Professional Documents
Culture Documents
Tom Zahavy 1 Zhongwen Xu 1 Vivek Veeriah 1 Matteo Hessel 1 Junhyuk Oh 1 Hado van Hasselt 1
David Silver 1 Satinder Singh 1 2
It is perhaps surprising that we may choose to optimize θ denotes the parameters of the agent and parameterises,
a different loss function in the inner loop, instead of the for example, the value function and the policy; these param-
outer loss we ultimately care about. However, this is not eters are randomly initialised at the beginning of an agent’s
a new idea. Regularization, for example, is a technique lifetime, and updated using back-propagation on a suitable
that changes the objective in the inner loop to balance the inner loss function. ζ denotes the hyperparameters, in-
bias-variance trade-off and avoid the risk of over fitting. cluding, for example, the parameters of the optimizer (e.g
In model-based RL, it was shown that the policy found the learning rate) or the parameters of the loss function (e.g.
using a smaller discount factor can actually be better than the discount factor); these may be tuned over the course of
a policy learned with the true discount factor (Jiang et al., many lifetimes (for instance via random search) to optimize
2015). Auxiliary tasks (Jaderberg et al., 2016) are another an outer (validation) loss function. In a typical deep RL
example, where gradients are taken w.r.t unsupervised loss setup, only these first two types of parameters need to be
functions in order to improve the agent’s representation. considered. In metagradient algorithms a third set of param-
Finally, it is well known that, in order to maximize the eters must be specified: the metaparameters, denoted η;
long term cumulative reward efficiently (the objective of these are a subset of the differentiable parameters in ζ that
the outer loop), RL agents must explore, i.e., act according start with some some initial value (itself a hyper-parameter),
to a different objective in the inner loop (accounting, for but that are then adapted during the course of training.
instance, for uncertainty).
This paper makes the following contributions. First, we 2.1. The metagradient approach
show that it is feasible to use metagradients to simulta- Metagradient RL (Xu et al., 2018) is a general framework
neously tune many critical hyperparameters (controlling for adapting, online, within a single lifetime, the differen-
important trade offs in a reinforcement learning agent), as tiable hyperparameters η. Consider an inner loss that is a
long as they are differentiable w.r.t a validation/outer loss. function of both the parameters θ and the metaparameters
Importantly, we show that this can be done online, within a η: Linner (θ; η). On each step of an inner loop, θ can be opti-
single lifetime, requiring only 30% additional compute. mized with a fixed η to minimize the inner loss Linner (θ; η) :
We demonstrate this by introducing two novel deep RL ar- .
chitectures, that extend IMPALA (Espeholt et al., 2018), a θ̃(ηt ) = θt+1 = θt − ∇θ Linner (θt ; ηt ). (1)
distributed actor-critic, by adding additional components In an outer loop, η can then be optimized to minimize the
with many more new hyperparameters to be tuned. The first outer loss by taking a metagradient step. As θ̃(η) is a func-
agent, referred to as a Self-Tuning Actor Critic (STAC), in- tion of η this corresponds to updating the η parameters by
troduces a leaky V-trace operator that mixes importance sam- differentiating the outer loss w.r.t η,
pling (IS) weights with truncated IS weights. The mixing
ηt+1 = ηt − ∇η Louter (θ̃(ηt )). (2)
coefficient in leaky V-trace is differentiable (unlike the orig-
inal V-trace) but similarly balances the variance-contraction The algorithm is general, as it implements a specific case of
trade-off in off-policy learning. The second architecture is online cross-validation, and can be applied, in principle, to
STACX (STAC with auXiliary tasks). Inspired by Fedus et al. any differentiable meta-parameter η used by the inner loss.
(2019), STACX augments STAC with parametric auxiliary
loss functions (each with it’s own hyper-parameters). 2.2. IMPALA
These agents allow us to show empirically, through ex- Specific instantiations of the metagradient RL framework
tensive ablation studies, that performance consistently im- require specification of the inner and outer loss functions.
proves as we expose more hyperparameters to metagradients. Since our agent builds on the IMPALA actor critic agent
In particular, when applied to 57 Atari games (Bellemare (Espeholt et al., 2018), we now provide a brief introduction.
et al., 2013), STACX achieved a normalized median score
of 364%, a new state-of-the-art for online model free agents. IMPALA maintains a policy πθ (at |xt ) and a value function
vθ (xt ) that are parameterized with parameters θ. These
policy and the value function are trained via an actor-critic
2. Background update with entropy regularization; such an update is often
represented (with slight abuse of notation) as the gradient
In the following, we consider three types of parameters:
of the following pseudo-loss function
2
1. θ – the agent parameters. L(θ) =gv · (vs − Vθ (xs ))
−gp ρs · log πθ (as |xs )(rs + γvs+1 − Vθ (xs ))
2. ζ – the hyperparameters. X
−ge · πθ (as |xs ) log πθ (as |xs ), (3)
3. η ⊂ ζ – the metaparameters. a
Self-Tuning Deep Reinforcement Learning
where gv , gp , ge are suitable loss coefficients. We refer to The outer loss is defined using Eq. (3),
the policy that generates the data for these updates as the
behaviour policy µ(at |xt ). In the on policy case, where Louter (η) = Louter (θ̃(η))
2
µ(at |xt ) = π(at |xt ), ρs = 1 , then vs is the n-steps boot-
Ps+n−1 =gvouter · vs − Vθ̃(η) (xs )
strapped return vs = t=s γ t−s rt + γ n V (xs+1 ).
−gpouter · ρs log πθ̃(η) (as |xs )(rs + γvs+1 − Vθ̃(η) (xs ))
IMPALA uses a distributed actor critic architecture, that X
assign copies of the policy parameters to multiple actors in −geouter · πθ̃(η) (as |xs ) log πθ̃(η) (as |xs )
different machines to achieve higher sample throughput. As a
a result, the target policy π on the learner machine can be
outer
+gkl · KL πθ̃(η) , πθ . (6)
several updates ahead of the actor’s policy µ that generated
the data used in an update. Such off policy discrepancy can
lead to biased updates, requiring to weight the updates with Notice the new Kullback–Leibler (KL) term in Eq. (6), mo-
IS weights for stable learning. Specifically, IMPALA (Espe- tivating the η-update not to change the policy too much.
holt et al., 2018) uses truncated IS weights to balance the Compared to the work by Xu et al. (2018) that only self-
variance-contraction trade off on these off-policy updates. tuned the inner-loss γ and λ (hence η = {γ,λ}), also
This corresponds to instantiating Eq. (3) with self-tuning the inner-loss coefficients corresponds to set-
s+n−1
X ting η = {γ, λ, gv , gp , ge }. These metaparameters allow for
γ t−s Πt−1
vs = V (xs ) + i=s ci δt V, (4) loss specific learning rates and support dynamically balanc-
t=s ing exploration with exploitation by adapting the entropy
where we define δt V = ρt (rt + γV (xt+1 ) − V (xt )) and loss weight 1 . The hyperparameters of STAC include the ini-
t |xt ) π(at |xt )
we set ρt = min(ρ̄, π(a
µ(at |xt ) ), ci = λ min(c̄, µ(at |xt ) ) for
tialisations of the metaparameters, the hyperparameters of
suitable truncation levels ρ̄ and c̄. the outer loss {γ outer , λouter , gvouter , gpouter , geouter }, the KL co-
outer
efficient gkl (set to 1), and the learning rate of the ADAM
2.3. Metagradient IMPALA meta optimizer (set to 1e − 3).
The metagradient agent in (Xu et al., 2018) uses the meta- IMPALA Self-Tuning IMPALA
gradient update rules from the previous section with the θ vθ , πθ vθ , πθ
actor-critic loss function of Espeholt et al. (2018). More {γ outer , λouter , gvouter , gpouter , geouter }
specifically, the inner loss is a parameterised version of the ζ {γ, λ, gv , gp , ge } Initialisations
IMPALA loss with metaparameters η based on Eq. (3), outer
ADAM parameter, gkl
L(θ; η) =0.5 (vs − Vθ (xs ))
2 η – {γ, λ, gv , gp , ge }
−ρs · log πθ (as |xs )(rs + γvs+1 − Vθ (xs ))
X Table 1. Parameters in & IMPALA and self-tuning IMPALA.
−0.01 πθ (as |xs ) log πθ (as |xs ),
a To set the initial values of the metaparameters of self-tuning
where η = {γ, λ}. Notice that γ and λ also affect the inner IMPALA, we use a simple “rule of thumb” and set them
loss through the definition of vs (Eq. (4)). to the values of the corresponding parameters in the outer
loss (e.g., the initial value of the inner-loss λ is set equal to
The outer loss is defined to be the policy gradient loss the outer-loss λ). For outer-loss hyperparameters that are
Louter (η) =Louter (θ̃(η)) common to IMPALA we default to the IMPALA settings.
= − ρs log πθ̃(η) (as |xs )(rs + γvs+1 − Vθ̃(η) (xs )). In the next two sections, we show how embracing self tuning
via metagradients enables us to augment this agent with a
3. Self-Tuning actor-critic agents parameterised Leaky V-trace operator and with self-tuned
auxiliary loss functions. These ideas are examples of how
We first consider a slightly extended version of the meta-
the ability to self-tune metaparameters via metagradients can
gradient IMPALA agent (Xu et al., 2018). Specifically, we
be used to introduce novel ideas into RL algorithms without
allow the metagradient to adapt learning rates for individual
requiring extensive tuning of the new hyperparameters.
components in the loss,
1
L(θ; η) =gv · (vs − Vθ (xs ))
2 There are a few additional subtle differences between this
self-tuning IMPALA agent and the metagradient agent from Xu
−gp ρs · log πθ (as |xs )(rs + γvs+1 − Vθ (xs )) et al. (2018). For example, we do not use the γ embedding used
X in (Xu et al., 2018). These differences are further discussed in the
−ge · πθ (as |xs ) log πθ (as |xs ). (5) supplementary where we also reproduce the results of Xu et al.
a (2018) in our code base.
Self-Tuning Deep Reinforcement Learning
3.2. STAC with auxiliary tasks (STACX) The metaparameters for STACX are
{γ i , λi , gvi , gpi , gei , αi }3i=1 . Since the outer loss is de-
Next, we introduce a new agent, that extends STAC with
fined only w.r.t head 1, introducing the auxiliary tasks into
auxiliary policies, value functions, and respective auxiliary
STACX does not require new hyperparameters for the outer
loss functions; this is new because the parameters that de-
loss. In addition, we use the same initialisation values for
fine the auxiliary tasks (the discount factors in this case)
all the auxiliary tasks. Thus, STACX has exactly the same
are self-tuned. As this agent has a new architecture in ad-
hyperparameters as STAC.
dition to an extended set of metaparameters we give it a
different acronym and denote it by STACX (STAC with aux-
iliary tasks). The auxiliary losses have the same parametric
350%
STAC STACX
4. Experiments
Figure 1. Block diagrams of STAC and STACX. In all of our experiments (with the exception of the ro-
bustness experiments in Section 4.3) we use the IMPALA
hyperparameters both for the IMPALA baseline and for
STACX’s architecture has a shared representation layer
the outer loss of the STAC agent, i.e., λouter = 1, αouter =
θshared , from which it splits into n different heads (Fig. 1). outer
1, gv,p,e = 1. We use γ = 0.995 as it was found to improve
For the shared representation layer we use the deep residual
the performance of IMPALA considerably (Xu et al., 2018).
net from (Espeholt et al., 2018). Each head has a policy and
a corresponding value function that are represented using a
2 layered MLP with parameters {θi }ni=1 . Each one of these 4.1. Atari learning curves
heads is trained in the inner loop to minimize a loss function We start by evaluating STAC and STACX in the Arcade
L(θi ; ηi ). parametrised by its own set of metaparameters ηi . Learning Environment (Bellemare et al., 2013, ALE). Fig. 2
The policy of the STACX agent is defined to be the policy presents the normalized median scores 4 during training. We
of a specific head (i = 1). The hyperparameters {ηi }ni=1 are found STACX to learn faster and achieve higher final per-
trained in the outer loop to improve the performance of this formance than STAC. We also compare these agents with
single head. Thus, the role of the auxiliary heads is to act versions of them without self tuning (fixing the metapa-
as auxiliary tasks (Jaderberg et al., 2016) and improve the rameters). The version of STACX with fixed unsupervised
shared representation θshared . Finally, notice that each head auxiliary tasks achieved a normalized median score of 247,
has its own policy π i , but the behaviour policy is fixed to be similar to that of UNREAL (Jaderberg et al., 2016) but not
π 1 . Thus, to optimize the auxiliary heads we use (Leaky) much better than IMPALA. In Fig. 3 we report the relative
V-trace for off policy corrections 3 . improvement of STACX over IMPALA in the individual
levels (an equivalent figure for STAC may be found in the
3
We also considered two extensions of this approach. (1) Ran- supplementary material).
dom ensemble: The policy head is chosen at random from [1, .., n],
and the hyperparameters are differentiated w.r.t the performance always led to a small decrease in performance when compared to
of each one of the heads in the outer loop. (2) Average ensemble: our auxiliary task agent without these extensions. Similar findings
The actor policy is defined to be the average logits of the heads, were reported in (Fedus et al., 2019).
4
and we learn one additional head for the value function of this Normalized median scores are computed as follows. For each
policy. The metagradient in the outer loop is taken with respect to Atari game, we compute the human normalized score after 200M
the actor policy, and /or, each one of the heads individually. While frames of training and average this over 3 different seeds; we then
these extensions seem interesting, in all of our experiments they report the overall median score over the 57 Atari domains.
Self-Tuning Deep Reinforcement Learning
Road runner
Wizard of wor
lines 5 , and in green and blue, we can see different ablations
Chopper command
Jamesbond
of STAC and STACX respectively. In these ablations η
Robotank
Kangaroo
corresponds to the subset of the hyperparameters that are
Defender being self-tuned, e.g., η = {γ} corresponds to self-tuning
Tennis
Beam rider a single loss function where only γ is self-tuned (and the
Gravitar
Gopher other hyperparameters are fixed).
Battle zone
Alien
Space invaders Inspecting Fig. 4 we observe that the performance of STAC
Ice hockey
Asteroids and STACX consistently improve as they self-tune more
Centipede
Star gunner
metaparameters. These metaparameters control different
Video pinball
Tutankham
trade-offs in reinforcement learning: discount factor con-
Asterix
Qbert
trols the effective horizon, loss coefficients affect learning
Assault
Zaxxon
rates, the Leaky V-trace coefficient controls the variance-
Breakout
Hero
contraction-bias trade-off in off policy RL.
Seaquest
Krull We have also experimented with adding more auxiliary
Surround
Berzerk losses, i.e, having 5 or 8 auxiliary loss functions. These
Fishing derby
Frostbite variations performed better then having a single loss func-
Bank heist
Crazy climber tion but slightly worse then having 3. This can be further
Venture
Enduro
explained by Fig. 9 (Section 4.4), which shows that the
Freeway
Pitfall
auxiliary heads are self-tuned to similar metaparameters.
Private eye
Montezuma revenge
Pong
Skiing
Demon attack { i, i, gvi , gei , gpi , i}3i = 1 364%
{ i, i, gvi , gei , gpi }3i = 1 341%
Ms pacman
Atlantis
249%
Solaris
Kung fu master
{ i, i}3i = 1
262%
Time pilot
Name this game
{ }3i = 1
247%
Riverraid
Amidar
Double dunk
{}3i = 1
Bowling
Boxing
Xu et al. 287%
Yars revenge
Up n down
{ , , gv, ge, gp, } 292%
Phoenix
35% 50% 75% 100% 150% 200% 275%
Relative human normalized scores
350% 500% { , , gv, ge, gp} 257%
{, } 248%
Figure 3. Mean human-normalized scores after 200M frames, rela-
{} 240%
tive improvement in percents of STACX over IMPALA. {} 243%
IMPALA 191%
Nature DQN 95%
22
{ i , i , g i , g i , g i , i }3
v e p i=1
20
{ i , i , g i , g i , g i }3
v e p i=1
18 { , , gv, ge, gp} Chopper_command
Normalized mean
16 { , }
Number of games
STAC
14 { } 50 IMPALA
12
10 0 0.99 0.992 0.995 0.997 0.999
8 Jamesbond
100
Normalized mean
6
4 50
2
0 0 0.99 0.992 0.995 0.997 0.999
Defender
Normalized mean
0% 20% 40% 60% 80% 100%
Percentage of relative improvement 20
Normalized mean
Figure 5. The number of games (y-axis) in which ablative varia- 5
tions of STAC and STACX improve the IMPALA baseline by at
0 0.99 0.992 0.995 0.997 0.999
least x percents (x-axis). Crazy_climber
Normalized mean
5
50
4.3. Robustness 0
0.25 0.35 0.5 0.75 1
Defender
Normalized mean
10
hyperparameter in the inner loss that we expose to meta-
gradients still requires an initialization (itself a hyperparame- 0 0.25 0.35 0.5 0.75 1
Crazy_climber
ter). Therefore, in this section, we investigate the robustness
Normalized mean
5
of STACX to its hyperparameters.
0 0.25 0.35 0.5 0.75 1
We begin with the hyperparameters of the outer loss. In
these experiments we compare the robustness of STACX
with that of IMPALA in the following manner. For each hy-
per parameter (γ, gv ) we select 5 perturbations. For STACX Figure 7. Robustness to the critic weight gv . Mean and confidence
we perturb the hyper parameter in outer loss (γ outer , gvouter ) intervals (over 3 seeds), after 200M frames of training. Left (blue)
bars correspond to STACX and right (red) bars correspond to
and for IMPALA we perturb the corresponding hyper pa-
IMPALA. STACX is better than IMPALA in 80% of the runs
rameter (γ, gv ). We randomly selected 5 Atari levels and
measured by mean.
present the mean and standard deviation across 3 random
seeds after 200M frames of training.
Fig. 6 presents the results for the discount factor. We can Next, we investigate the robustness of STACX to the initial-
see that overall (in 72% of the configurations measured isation of the metaparameters in Fig. 8. We selected values
by mean), STACX indeed performs better than IMPALA. that are close to 1 as our design principle is to initialise
Similarly, Fig. 7 shows the robustness of STACX to the the metaparameters to be similar to the hyperparameters in
critic weight (gv ), where STACX improves over IMPALA the outer loss. We observe that overall, the method is quite
in 80% of the configurations. robust to different initialisations.
Self-Tuning Deep Reinforcement Learning
60
40
1e8 1e8
4
4 1.0 1.0
30 40 10
0.8
gp
gv
0.5
20
20 5
2 2
0.6
10 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0
1e8 1e8
0 0 0 0 0 1.0 1.0
0.99
0.992
0.995
0.997
0.999
0.99
0.992
0.995
0.997
0.999
0.99
0.992
0.995
0.997
0.999
0.99
0.992
0.995
0.997
0.999
0.99
0.992
0.995
0.997
0.999
ge
100 30 0.5 0.5
50
25 6
80
6 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0
40 1e8 1e8
Discount init value
20
60
30 4 4 head 1 head 2 head 3
15
20 40
10
2 2
10 20
5
Figure 9. Adaptivity in Jamesbond.
0 0 0 0 0
0.99
0.992
0.995
0.997
0.999
0.99
0.992
0.995
0.997
0.999
0.99
0.992
0.995
0.997
0.999
0.99
0.992
0.995
0.997
0.999
0.99
0.992
0.995
0.997
0.999
Jiang, N., Kulesza, A., Singh, S., and Lewis, R. The depen- Veeriah, V., Hessel, M., Xu, Z., Rajendran, J., Lewis, R. L.,
dence of effective planning horizon on model accuracy. Oh, J., van Hasselt, H. P., Silver, D., and Singh, S. Discov-
In Proceedings of the 2015 International Conference on ery of useful questions as auxiliary tasks. In Advances in
Autonomous Agents and Multiagent Systems, pp. 1181– Neural Information Processing Systems, pp. 9306–9317,
1189, 2015. 2019.
Maas, A. L., Hannun, A. Y., and Ng, A. Y. Rectifier non- White, M. and White, A. A greedy approach to adapting
linearities improve neural network acoustic models. In the trace parameter for temporal difference learning. In
in ICML Workshop on Deep Learning for Audio, Speech Proceedings of the 2016 International Conference on
and Language Processing. Citeseer, 2013. Autonomous Agents & Multiagent Systems, pp. 557–565,
2016.
Maclaurin, D., Duvenaud, D., and Adams, R. Gradient-
based hyperparameter optimization through reversible Xu, B., Wang, N., Chen, T., and Li, M. Empirical evaluation
learning. In International Conference on Machine Learn- of rectified activations in convolutional network. arXiv
ing, pp. 2113–2122, 2015. preprint arXiv:1505.00853, 2015.
Mann, T. A., Penedones, H., Mannor, S., and Hester, T. Xu, Z., van Hasselt, H. P., and Silver, D. Meta-gradient re-
Adaptive lambda least-squares temporal difference learn- inforcement learning. In Advances in neural information
ing. arXiv preprint arXiv:1612.09465, 2016. processing systems, pp. 2396–2407, 2018.
Self-Tuning Deep Reinforcement Learning
6. Additional results
Jamesbond
Wizard of wor
Chopper command
Robotank
Battle zone
Tennis
Centipede
Road runner
Gravitar
Kangaroo
Gopher
Assault
Beam rider
Video pinball
Crazy climber
Alien
Ice hockey
Tutankham
Up n down
Breakout
Zaxxon
Seaquest
Fishing derby
Krull
Frostbite
Demon attack
Star gunner
Montezuma revenge
Venture
Pitfall
Private eye
Boxing
Freeway
Pong
Enduro
Solaris
Hero
Surround
Bank heist
Kung fu master
Ms pacman
Berzerk
Double dunk
Atlantis
Name this game
Qbert
Time pilot
Asterix
Bowling
Defender
Amidar
Asteroids
Riverraid
Yars revenge
Space invaders
Phoenix
Skiing
35% 50% 75% 100% 150% 200% 275% 350% 500%
Relative human normalized scores
Figure 11. Mean human-normalized scores after 200M frames, relative improvement in percents of STAC over the IMPALA baseline.
Self-Tuning Deep Reinforcement Learning
where the expectation Eµ is with respect to the behaviour policy µ which has generated the trajectory (xt )t≥0 , i.e.,
x0 = x, xt+1 ∼ P (·|xt , at ), at ∼ µ(·|xt ). Similar to (Espeholt et al., 2018), we consider the infinite-horizon operator but
very similar results hold for the n-step truncated operator.
Let
π(at |xt )
IS(xt ) = ,
µ(at |xt )
be importance sampling weights, let
Proof. The proof follows the proof of V-trace from (Espeholt et al., 2018) with adaptations for the leaky V-trace coefficients.
We have that X
γ t Πt−2
R̃V1 (x) − R̃V2 (x) = Eµ s=0 cs [ρ̃t−1 − c̃t−1 ρ̃t ] (V1 (xt ) − V2 (xt )) .
t≥0
since Eµ IS(xt ) = 1, and therefore, Eµ ρt ≤ 1. Furthermore, since ρ̄ ≥ c̄ and αρ ≥ αc , we have that ∀t, ρ̃t ≥ c̃t . Thus, the
coefficients κ̃t are non negative in expectation, since
X
γ t Eµ Πt−2
s=0 c̃s [ρ̃t−1 − c̃t−1 ρ̃t ]
t≥0
X X
γ t Eµ Πt−2 γ t Eµ Πt−1
= s=0 c̃s ρ̃t−1 − s=0 c̃s ρ̃t
t≥0 t≥0
X
=γ −1 − (γ −1 − 1) γ t Eµ Πt−2
s=0 c̃s ρ̃t−1
t≥0
where Eq. (11) holds since we expanded only the first two elements in the sum, and all the elements in this sum are positive,
and Eq. (12) holds by the assumption.
We deduce that kR̃V1 (x) − R̃V2 (x)k∞ ≤ η̃kV1 (x) − V2 (x)k∞ , with η̃ = 1 − (1 − γ) (αρ β + 1 − αρ ) < 1, so R̃ is a
contraction mapping. Furthermore, we can see that the parameter αρ controls the contraction rate, for αρ = 1 we get the
contraction rate of V-trace η̃ = 1 − (1 − γ)β and as αρ gets smaller with get better contraction as with αρ = 0 we get that
η̃ = γ.
Thus R̃ possesses a unique fixed point. Let us now prove that this fixed point is V πρ̄,αρ , where
is a policy that mixes the target policy with the V-trace policy.
We have:
where we get that the left side (up to the summation on b) of the last equality equals zero since this is the Bellman equation
for V πρ̄,αρ . We deduce that R̃V πρ̄,αρ = V πρ̄,αρ , thus, V πρ̄,αρ is the unique fixed point of R̃.
Self-Tuning Deep Reinforcement Learning
8. Reproducibility
Inspecting the results in Fig. 4 one may notice that there are small differences between the results of IMAPALA and using
meta gradients to tune only λ, γ compared to the results that were reported in (Xu et al., 2018).
We investigated the possible reasons for these differences. First, our method was implemented in a different code base. Our
code is written in JAX, compared to the implementation in (Xu et al., 2018) that was written in TensorFlow. This may
explain the small difference in final performance between our IMAPALA baseline (Fig. 4) and the the result of Xu et. al.
which is slightly higher (257.1).
Second, Xu et. al. observed that embedding the γ hyper parameter into the θ network improved their results significantly,
reaching a final performance (when learning γ, λ) of 287.7 (see section 1.4 in (Xu et al., 2018) for more details). Our
method, on the other hand, only achieved a score of 240 in this ablative study. We further investigated this difference
by introducing the γ embedding intro our architecture. With γ embedding, our method achieved a score of 280.6 which
almost reproduces the results in (Xu et al., 2018). We then introduced the same embedding mechanism to our model with
auxiliary loss functions. In this case for auxiliary loss i we embed γi . We experimented with two variants, one that shares the
embedding weights across the auxiliary tasks and one that learns a specific embedding for each auxiliary task. Both of these
variants performed similarly (306.8, 3.077 respectively) which is better then our previous result with embedding and without
auxiliary losses that achieved 280.6. Unfortunately, the performance of the auxiliary loss architecture actually performed
better without the embedding (353.4) and we therefor ended up not using the γ embedding in our architecture. We leave it to
future work to further investigate methods of combining the embedding mechanisms with the auxiliary loss functions.
Self-Tuning Deep Reinforcement Learning
Self-Tuning Deep Reinforcement Learning
0.996 0.9
0.995 0.995 1200
0.8 0.8 0.8
0.994 0.990 1000
0.7 0.990
0.6
0.992 0.985 0.6 0.6 800
0.985
0.990 0.980 0.5 0.4 600
0.980 0.4
0.988 0.975 0.4 400
0.2
0.3 0.975 200
0.986 0.970 0.2
0.2 0.0 0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
0.98 800
0.994 0.98 0.8
0.8 0.8
0.96
0.992 0.96 600
0.94 0.6
0.990 0.6 0.94 0.6
0.92 400
0.4
0.988 0.92
0.90 0.4 0.4
0.986 0.88 0.90 0.2 200
0.984 0.86 0.2 0.88 0.2
0.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
0.98 0.975
0.994 0.8 0.8 0.8 25000
0.950
0.96
0.992 0.6 0.925 0.6 0.6 20000
0.94 0.900 15000
0.990 0.4 0.875 0.4 0.4
0.92
10000
0.988 0.850
0.90 0.2 0.2 0.2
0.825 5000
0.986 0.88 0.800 0.0 0
0.0 0.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
0.996 0.995
0.995 0.8 0.8 0.8 25000
0.994 0.990
0.990 0.6 0.985 0.6 0.6 20000
0.992
0.980 15000
0.990 0.985 0.4 0.4 0.4
0.975 10000
0.988
0.2 0.970 0.2 0.2
0.980 5000
0.986
0.965
0.0 0.0 0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
0.9
0.95 0.9 0.95 800000
0.99 0.8 0.8
0.90 0.8 0.7 0.90 600000
0.98 0.6
0.85 0.7 0.6 0.85
0.5 400000
0.80 0.97 0.4 0.6 0.80
0.4
200000
0.75 0.5 0.3 0.75
0.96 0.2
0.4 0.2 0
0.70 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
0.9950
0.9 0.9950 0.975 150000
0.9925 0.99 0.8
0.8 125000
0.9925 0.950
0.9900 0.98 0.7 0.6 0.9900 0.925 100000
0.9875 0.6
0.97 0.9875 75000
0.9850 0.5 0.4 0.900
0.9825 0.96 0.9850 0.875 50000
0.4
0.2 0.9825
0.9800 0.95 0.3 0.850 25000
0.9775 0.2 0.0 0.9800 0
0.825
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
0.9950 0.99
0.98 0.9 140000
0.9925 0.8 0.95
0.96 0.8 0.98 120000
0.9900
0.94 0.7 0.6 0.90 100000
0.9875 0.97
0.92 0.6 80000
0.9850 0.4 0.85
0.90 0.5 60000
0.9825 0.96
0.88 0.4 0.80 40000
0.9800 0.2
0.86 0.3 0.95 20000
0.9775 0.2 0.0 0.75 0
0.84
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
0.995
0.998 140000
0.995 0.95
0.990 0.8 0.8
0.996 120000
0.990 0.90
0.985 0.6 0.994 100000
0.985 0.6 0.85 80000
0.980 0.992
0.980 0.4 60000
0.4 0.990 0.80
0.975 40000
0.975 0.2 0.988 0.75 20000
0.970 0.2
0.970 0.986 0.70 0
0.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
800000
0.994 0.998 0.9 0.998
0.96 0.8
0.996 0.8 0.996 600000
0.992 0.94 0.6
0.994 0.7 0.994
0.990 0.92 400000
0.992 0.6 0.992 0.4
0.90
0.988 0.990 0.5 0.990
0.88 0.2 200000
0.4 0.988
0.986 0.988
0.86
0.3 0.986 0.0 0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
0.98 800000
0.994 0.998 0.9 0.998 0.8
0.96
0.992 0.996 0.8 600000
0.996 0.94 0.6
0.990 0.994 0.7
0.994 0.92 0.4 400000
0.992 0.6
0.988 0.90
0.992 0.2 200000
0.990 0.5
0.986 0.88
0.988 0.990 0.0 0
0.4
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
Figure 12. Meta parameters and reward in each Atari game (and seed) during learning. Different colors correspond to different heads,
blue is the main (policy) head.
Self-Tuning Deep Reinforcement Learning
gp gv ge reward
bank_heist, seed=1
1.0 1.00
1.00 1.0 1.0 50000
0.9950
0.95 0.9
0.9925 0.8 0.95 40000
0.90 0.8
0.9900 0.8
0.9875 0.85 0.6 0.90 30000
0.7 0.6
0.9850 0.80
0.75 0.4 0.6 0.85 20000
0.9825 0.4
0.9800 0.70 0.5 10000
0.2
0.9775 0.65 0.80 0.2
0.4
0.9750 0.0 0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
battle_zone, seed=2
1100
0.994 0.998 0.98
0.9 0.8 1000
0.995 0.996 0.96
0.992 900
0.8
0.994 0.6
0.990 0.990 0.94 800
0.7 0.992
0.988 0.92 0.4 700
0.6 0.990
0.986 0.985 600
0.988 0.90 0.2
0.5 500
0.984 0.980 0.986
0.4 0.88 400
0.984 0.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
0.9 0.9
0.8 0.8 0.8 80
0.990 0.8
0.8
0.6 0.6 0.6 0.7 60
0.985
0.7 0.6
0.980 0.4 0.4 0.4 0.5 40
0.6
0.975 0.2 0.4 20
0.5 0.2 0.2
0.3
0.970 0.4 0.0 0.2 0
0.0 0.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
0.996 0.998
0.998 0.998 0.998 0.8 4
0.994 0.996
0.996 0.996 0.996 0.6
0.992 0.994 2
0.994 0.994 0.994
0.990 0.992 0.4
0.992 0
0.988 0.992 0.992 0.990 0.2
0.986 0.990 0.990 0.988 2
0.990 0.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
Figure 13. Meta parameters and reward in each Atari game (and seed) during learning. Different colors correspond to different heads,
blue is the main (policy) head.
Self-Tuning Deep Reinforcement Learning
gp gv ge reward
breakout, seed=1
Figure 14. Meta parameters and reward in each Atari game (and seed) during learning. Different colors correspond to different heads,
blue is the main (policy) head.
Self-Tuning Deep Reinforcement Learning
gp gv ge reward
double_dunk, seed=3 double_dunk, seed=2 double_dunk, seed=1
0.9904
enduro, seed=1
0.998 0.995
0.994 0.996 0.04
0.994 0.9905
0.992 0.996 0.9902
0.993 0.994 0.02
0.994 0.9900
0.990 0.992
0.992 0.9900 0.00
0.988 0.992
0.991
0.990 0.9895 0.02
0.986 0.990 0.990 0.9898
0.988 0.989 0.04
0.984 0.9890
0.986 0.988
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
1.000 0.9906
enduro, seed=2
0.998 0.998
0.998 0.996 0.9902 0.04
0.996 0.9905
0.996 0.996 0.9901 0.02
0.994 0.994 0.9904
0.992 0.994 0.994 0.9900 0.00
0.992 0.9903
0.990 0.992 0.992 0.9899
0.9902 0.02
0.988 0.990 0.990 0.9898
0.990 0.9901 0.04
0.986 0.988
0.9897
0.988
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
1.000 0.9904
enduro, seed=3
0.992
freeway, seed=1
0.994 0.06
0.992 0.994 0.99000 0.9905
0.991
0.9904 0.05
0.992 0.990 0.992 0.98975
0.990 0.04
0.9903
0.990 0.989 0.990 0.98950
0.988 0.9902 0.03
0.988 0.988 0.988 0.98925 0.9901 0.02
0.986 0.987 0.986
0.986 0.98900 0.9900
0.01
0.984 0.986 0.984 0.98875 0.9899
0.984 0.00
0.985
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
freeway, seed=2
0.9925 0.9904
freeway, seed=3
1.00
frostbite, seed=3
0.998
0.95 0.9 100000
0.996 0.9975
0.8 0.8 0.90
0.994 0.9950 0.8 80000
0.6 0.85
0.992 0.7 60000
0.9925 0.6 0.80
0.990
0.9900 0.4 0.75 0.6 40000
0.988
0.4 0.70 0.5
0.986 0.9875 0.2 20000
0.65 0.4
0.984 0.9850 0.2 0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
0.995 0.98
0.9975 0.9 100000
0.990 0.99 0.8 0.96
0.9950 0.8 80000
0.94
0.985 0.9925
0.98 0.6 0.92 0.7 60000
0.980 0.9900 0.6
0.90 40000
0.97 0.4 0.9875
0.975 0.88 0.5 20000
0.2 0.9850 0.86
0.970 0.96 0.4 0
0.9825
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
0.996 120000
0.994 0.9975 0.95 0.9 100000
0.9
0.9950 0.8
0.992 0.90 0.8 80000
0.990 0.9925 0.8 0.7
0.6 0.85 60000
0.988 0.9900 0.6
0.80 0.7 40000
0.986 0.9875 0.4 0.5
0.75 20000
0.984 0.9850 0.6 0.4
0.70 0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
Figure 15. Meta parameters and reward in each Atari game (and seed) during learning. Different colors correspond to different heads,
blue is the main (policy) head.
Self-Tuning Deep Reinforcement Learning
1.000 1.0
gp 1.000
gv 1.00
ge 1.0 3500
reward
gravitar, seed=1
0.995
0.998 0.95 0.8 3000
0.990 0.995 0.8
2500
0.996 0.90
0.985 0.6 2000
0.990 0.6
0.994 0.85 1500
0.980 0.4
0.985 0.4
0.80 1000
0.975 0.992 0.2
0.2 500
0.980 0.990 0.75
0.970 0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
0.998 0.95
0.995 0.998 0.8 2500
0.996 0.8 0.90
0.85 2000
0.990 0.994 0.996 0.6
0.6
0.992 0.80 1500
0.985 0.994 0.75
0.990 0.4 0.4
1000
0.992 0.70
0.988
0.980 0.65 0.2 500
0.986 0.2
0.990 0.60
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
1.0
kangaroo, seed=1
1.0
kangaroo, seed=2
1.0 1.00
kangaroo, seed=3
0.9 0.98
0.985 0.8 0.996
0.98 8000
0.8 0.97
0.980
0.97 0.6 0.994
0.975 0.7 0.96 6000
0.96 0.992
0.970 0.6 0.95
0.4
0.965 0.95 4000
0.94 0.990
0.94 0.5
0.960 0.2 0.93 0.988
2000
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
1.0 9000
0.995 1.00 1.00 0.990
0.998 8000
0.9 0.985
krull, seed=2
0.99 0.95
0.985 0.8 0.996 7000
0.98
0.980 0.98 0.7 0.90 6000
0.994
0.975 0.6 0.97
0.97 5000
0.5 0.85 0.992
0.970
0.96 0.4 0.96 4000
0.965 0.80 0.990 3000
0.960 0.95 0.3
0.95
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
Figure 16. Meta parameters and reward in each Atari game (and seed) during learning. Different colors correspond to different heads,
blue is the main (policy) head.
Self-Tuning Deep Reinforcement Learning
kung_fu_master, seed=3kung_fu_master, seed=2kung_fu_master, seed=1
0.9905 0.05
0.995 0.992
0.99 0.995 0.995 0.9900
0.991 0.00
0.990 0.9895
0.990 0.990
0.98 0.990
0.985 0.9890 0.05
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
montezuma_revenge, seed=2
2
0.995 0.995 0.990
0.995 0.992
0.990 0.990 0.99 1
0.991 0.989
0.985 0.985 0.990
0.98 0.990 0
0.980 0.988
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
ms_pacman, seed=1montezuma_revenge, seed=3
1.00 4
0.993 0.990
0.99 0.995 0.99 0.995 0.992 0.988
2
0.990 0.98 0.991 0.986
0.98 0.990
0.985 0.990 0.984 0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
Figure 17. Meta parameters and reward in each Atari game (and seed) during learning. Different colors correspond to different heads,
blue is the main (policy) head.
Self-Tuning Deep Reinforcement Learning
0.95
0.996 0.98 0.8 0.8 0.9950
0.90 10
0.994 0.85 0.9925
0.96 0.6 0.6
0.992 0.80 0.9900 0
0.990 0.4 0.75 0.4 0.9875
0.94
0.988 0.70 0.9850 10
0.2 0.65 0.2 0.9825
0.986 0.92
0.60 0.9800 20
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
0.99 0.998 15
0.995 0.8 0.9
0.98 0.8
0.996 10
0.990 0.97 0.8
0.6 0.6 0.994 5
0.985 0.96 0.7 0
0.95 0.4 0.992 5
0.980 0.6 0.4
0.94 0.990 10
0.975 0.93 0.2 0.5 0.2
0.988 15
0.92 0.4 20
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
private_eye, seed=3 private_eye, seed=2 private_eye, seed=1
15000
0.994 0.95 0.8 0.99
0.8 0.8
12500
0.992 0.90 0.98
0.6 0.6 0.6 10000
0.990 7500
0.85 0.4 0.97 0.4
0.4
0.988 5000
0.80 0.2
0.2 0.96 0.2 2500
0.986
0.75 0.0 0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
1.0 1.0
riverraid, seed=2
0.996 1.0
1.00 1.0 1.0 1.0
0.994 0.9 500000
0.9 0.9
0.95 0.8 0.8
0.992 400000
0.7 0.8
0.8
0.990 0.90 0.6 0.6 0.7 300000
0.5 0.7 200000
0.988 0.85 0.4 0.6
0.4
0.3 0.6 0.5 100000
0.986
0.80 0.2 0.2
0.4 0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
robotank, seed=1
Figure 18. Meta parameters and reward in each Atari game (and seed) during learning. Different colors correspond to different heads,
blue is the main (policy) head.
Self-Tuning Deep Reinforcement Learning
gp gv ge reward
seaquest, seed=1
0.994
skiing, seed=1
0.99 10000
0.994 0.998 0.990
0.998 0.98
0.992 0.9 0.996 0.989 15000
0.996
0.990 0.97
0.994 0.8 0.994 0.988 20000
0.988 0.96
0.992 0.987
0.992 25000
0.986 0.7 0.95 0.986
0.990 0.990 30000
0.984
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
solaris, seed=1
Figure 19. Meta parameters and reward in each Atari game (and seed) during learning. Different colors correspond to different heads,
blue is the main (policy) head.
Self-Tuning Deep Reinforcement Learning
0.9 20
0.95 0.9 0.998
0.8 0.8
0.8 0.8 10
0.90 0.7 0.996
0.7 0.6
0.6 0
0.85 0.6
0.6 0.4 0.994
0.80 0.4 0.5
0.5 10
0.2 0.4 0.992
0.75 0.4 0.3
0.2 20
0.70 0.3 0.0 0.2 0.990
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
0.95 20
0.9 0.998
0.8 0.8 0.8
0.90 0.8 10
0.85 0.996
0.7 0.6 0.6
0.6 0
0.80 0.6 0.994
0.75 0.4 0.4
0.5 0.4 10
0.70 0.2 0.992
0.4
0.65 0.2
0.3 0.2 0.990 20
0.60 0.0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
20
0.95 0.9 0.998
0.8 0.8 0.8
0.8 10
0.90 0.7 0.6 0.996
0.6 0.6
0.6 0
0.85 0.4 0.994
0.5 0.4
0.80 0.4 10
0.4 0.2 0.992
0.2
0.75 0.3 0.2 20
0.0 0.990
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
time_pilot, seed=1
0.995 0.997
venture, seed=1
0.998 1.000
venture, seed=2
0.998 0.9903
0.996 0.996 0.9904 0.04
0.998 0.9902
0.994 0.996 0.9901 0.02
0.996 0.994 0.9903
0.992 0.9900
0.994 0.994 0.9899 0.00
0.990 0.992 0.9902
0.992 0.9898 0.02
0.988 0.992
0.990 0.9897 0.9901
0.990 0.04
0.986 0.990 0.9896
0.988 0.9895
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
1.000 0.998
venture, seed=3
0.998
0.998 0.9900 0.9905
0.04
0.996 0.998 0.996
0.994 0.996 0.9899 0.9904 0.02
0.996 0.994
0.992 0.9898 0.9903 0.00
0.994 0.994
0.990 0.992
0.9897 0.9902 0.02
0.988 0.992 0.990 0.992
Figure 20. Meta parameters and reward in each Atari game (and seed) during learning. Different colors correspond to different heads,
blue is the main (policy) head.
Self-Tuning Deep Reinforcement Learning
1.00 1.0
gp 1.000
gv ge 1.0
reward
0.996 50000
0.995
0.990
0.992
0.8 0.996
0.5
0.984 0.990 0.2 10000
0.975
0.96
0.982 0.4
0.988
0.0 0
0.980
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
0.998 40000
1.000 1.0 1.000 1.0
0.995 35000
0.996
0.998 0.998 0.8
0.9 30000
wizard_of_wor, seed=2
0.990
0.994
0.996 25000
0.996
0.985 0.6
0.992 0.8
20000
0.994
0.994 0.980
0.990 0.4
15000
0.7
0.992 0.975
0.992 10000
0.988
0.2
0.998 50000
1.00 1.0 1.000 1.0
0.996 0.99
0.998
0.99 0.9 40000
0.8
wizard_of_wor, seed=3
0.994
0.996
0.8 0.98
0.992 0.98 30000
0.994 0.6
0.990 0.7
0.97 0.992
0.97 0.4 20000
0.988
0.6 0.990
0.96
0.986
0.96 0.2 10000
0.5 0.988
0.984
0.95
0.986
0.4 0.0 0
0.982
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
0.998
1.0 1.00 1.000
1.0 1.0
140000
0.996 0.95
0.9
0.994
0.85 100000
0.6 0.6
0.992 0.990
0.8 0.80 80000
0.7 0.70
0.988 40000
0.2 0.2
0.65
0.980
20000
0.986
0.6 0.60
0.0 0.0
0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
1.000
1.0 1.0 1.00 45000
1.00 1.0
0.998 40000
0.99 0.9
0.8 0.95
0.8
0.996
yars_revenge, seed=2
35000
0.98
0.8
0.994 0.90 30000
0.6
0.6
0.97
0.7 25000
0.992
0.85
0.96 0.4
20000
0.990 0.4 0.6
0.8
0.90
0.98
0.992 0.7 80000
0.6
0.85
0.6 0.6
0.97
0.990 60000
0.5 0.80
0.4
0.96
0.4 40000
0.988 0.4 0.75
0.965
0.988 0.2
0.990 5000
0.5 0.990
0.986 0.960
0.988 0.0 0
0.4 0.988
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
1.000
1.000 1.0 1.000 0.99 1.0
30000
0.998
0.9
0.998 0.998 0.8 25000
0.98
0.996
0.8
zaxxon, seed=2
0.992
15000
0.994 0.6 0.994
0.96 0.4
0.990
0.5 10000
0.992
0.988 0.992
0.95 0.2
0.4
5000
0.986
0.990
0.990
0.3
0.984 0.94 0.0 0
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
0.9
0.998 0.998 0.8 25000
0.996
0.98
zaxxon, seed=3
0.8
0.994 20000
0.996 0.996 0.6
0.97
0.992 0.7
15000
0.988
0.992 0.992 0.2
0.5
5000
0.986 0.95
0.990 0.4 0.990 0.0 0
0.984
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 0.5 1.0 1.5 2.0
1e8 1e8 1e8 1e8 1e8 1e8 1e8
Figure 21. Meta parameters and reward in each Atari game (and seed) during learning. Different colors correspond to different heads,
blue is the main (policy) head.