You are on page 1of 27

MICo: Improved representations via sampling-based

state similarity for Markov decision processes

Pablo Samuel Castro∗ Tyler Kastner∗ Prakash Panangaden Mark Rowland


Google Research, Brain Team McGill University McGill University DeepMind
arXiv:2106.08229v4 [cs.LG] 21 Jan 2022

Abstract
We present a new behavioural distance over the state space of a Markov deci-
sion process, and demonstrate the use of this distance as an effective means of
shaping the learnt representations of deep reinforcement learning agents. While
existing notions of state similarity are typically difficult to learn at scale due to
high computational cost and lack of sample-based algorithms, our newly-proposed
distance addresses both of these issues. In addition to providing detailed theoretical
analysis, we provide empirical evidence that learning this distance alongside the
value function yields structured and informative representations, including strong
results on the Arcade Learning Environment benchmark.

1 Introduction
The success of reinforcement learning (RL) algorithms in large-scale, complex tasks depends on
forming useful representations of the environment with which the algorithms interact. Feature
selection and feature learning has long been an important subdomain of RL, and with the advent of
deep reinforcement learning there has been much recent interest in understanding and improving the
representations learnt by RL agents.
Much of the work in representation learning has taken place from the perspective of auxiliary
tasks [Jaderberg et al., 2017, Bellemare et al., 2017, Fedus et al., 2019]; in addition to the primary
reinforcement learning task, the agent may attempt to predict and control additional aspects of the
environment. Auxiliary tasks shape the agent’s representation of the environment implicitly, typically
via gradient descent on the additional learning objectives. As such, while auxiliary tasks continue to
play an important role in improving the performance of deep RL algorithms, our understanding of
the effects of auxiliary tasks on representations in RL is still in its infancy.
In contrast to the implicit representation shaping of auxiliary tasks, a separate line of work on be-
havioural metrics, such as bisimulation metrics [Desharnais et al., 1999, 2004, Ferns et al., 2004,
2006], aims to capture structure in the environment by learning a metric measuring behavioral
similarity between states. Recent works have successfully used behavioural metrics to shape the
representations of deep RL agents [Gelada et al., 2019, Zhang et al., 2021, Agarwal et al., 2021a].
However, in practice behavioural metrics are difficult to estimate from both statistical and computa-
tional perspectives, and these works either rely on specific assumptions about transition dynamics to
make the estimation tractable, and as such can only be applied to limited classes of environments, or
are applied to more general classes of environments not covered by theoretical guarantees.
The principal objective of this work is to develop new measures of behavioral similarity that avoid
the statistical and computational difficulties described above, and simultaneously capture richer
information about the environment. We introduce the MICo (Matching under Independent Couplings)
distance, and develop the theory around its computation and estimation, making comparisons with
existing metrics on the basis of computational and statistical efficiency. We demonstrate the usefulness

Equal contribution. Correspondence to Pablo Samuel Castro: psc@google.com.

35th Conference on Neural Information Processing Systems (NeurIPS 2021).


250 0.35 MICo

IQM Human Normalized Score


DBC-Det
0.30 SAC

IQM Normalized Score


DBC+MICo
200 DBC
0.25
150 0.20
Agent
100 M-IQN 0.15
DQN

50
Rainbow
QR-DQN 0.10
IQN
Loss
Regular
0.05
0 + MICo loss
0 25 50 75 100 125 150 175 200 0 80 160 240 320 400
Number of Frames (in millions) Number of Frames (x 1000)

Figure 1: Interquantile Mean human normalized scores of all the agents and losses on the ALE suite
(left) and on the DM-Control suite (right), both run with five independent seeds for each agent and
environment. In both suites MICo provides a clear advantage.
of the representations that MICo yields, both through empirical evaluations in small problems (where
we can compute them exactly) as well as in two large benchmark suites: (1) the Arcade Learning
Environment [Bellemare et al., 2013, Machado et al., 2018], in which the performance of a wide
variety of existing value-based deep RL agents is improved by directly shaping representations via
the MICo distance (see Figure 1, left), and (2) the DM-Control suite [Tassa et al., 2018], in which we
demonstrate it can improve the performance of both Soft Actor-Critic [Haarnoja et al., 2018] and the
recently introduced DBC [Zhang et al., 2021] (see Figure 1, right).

2 Background
Before describing the details of our contributions, we give a brief overview of the required background
in reinforcement learning and bisimulation. We provide more extensive background in Appendix A.
Reinforcement learning. We consider a Markov decision process (X , A, γ, P, r) defined by a
finite state space X , finite action space A, transition kernel P : X × A → P(X ), reward function
r : X × A → R, and discount factor γ ∈ [0, 1). For notational convenience we will write Pxa
and rxa for transitions and rewards, respectively. Policies are mappings from states to distributions
over actions: π ∈ P(A) X π
 function V : X → R defined via the recurrence:
and induce a value
0
π
 a π
V (x) := Ea∼π(x) rx + γEx0 ∼Pxa [V (x )] . In RL we are concerned with finding the optimal
policy π ∗ = arg maxπ∈P(A)X V π from interaction with sample trajectories with an MDP, without
knowledge of P or r (and sometimes not even X ), and the optimal value function V ∗ induced by π ∗ .
State similarity and bisimulation metrics. Various notions of similarity between states in MDPs
have been considered in the RL literature, with applications in policy transfer, state aggregation, and
representation learning. The bisimulation metric [Ferns et al., 2004] is of particular relevance for
this paper, and defines state similarity in an MDP by declaring two states x, y ∈ X to be close if
their immediate rewards are similar, and the transition dynamics at each state leads to next states
which are also judged to be similar. This self-referential notion is mathematically formalised by
defining the bisimulation metric d∼ as the unique fixed-point of the operator TK : M(X) → M(X),
where M(X) = {d ∈ [0, ∞)X ×X : d symmetric and satisfies the triangle inequality} is the set of
pseudometrics on X , given by TK (d)(x, y) = maxa∈A [|rxa − rya | + γWd (Pxa , Pya )]. Here, Wd is the
Kantorovich distance (also known as the Wasserstein distance) over the set of distributions P(X )
with base distance d, defined by Wd (µ, ν) = inf X∼µ,Y ∼ν E[d(X, Y )], for all µ, ν ∈ P(X ), where
the infimum is taken over all couplings of (X, Y ) with the prescribed marginals [Villani, 2008].
The mapping TK is a γ-contraction on M(X) under the L∞ norm [Ferns et al., 2011], and thus
by standard contraction mapping arguments analogous to those used to study value iteration, it has
a unique fixed point, the bisimulation metric d∼ . Ferns et al. [2004] show that this metric bounds
differences in the optimal value function, hence its importance in RL:
|V ∗ (x) − V ∗ (y)| ≤ d∼ (x, y) ∀x, y ∈ X . (1)

Representation learning in RL. In large-scale environments, it is infeasible to express value func-


tions directly as vectors in RX ×A . Instead, RL agents must approximate value functions in a
more concise manner, by forming a representation of the environment, that is, a feature embedding
φ : X → RM , and predicting state-action values linearly from these features. Representation learning
is the problem of finding a useful representation φ. Increasingly, deep RL agents are equipped with

2
additional losses to aid representation learning. A common approach is to require the agent to make
additional predictions (so-called auxilliary tasks) with its representation, typically with the aid of
extra network parameters, with the intuition that an agent is more likely to learn useful features if it is
required to solve many related tasks. We refer to such methods as implicit representation shaping,
since improved representations are a side-effect of learning to solve auxiliary tasks.
Since bisimulation metrics capture additional information about the MDP in addition to that sum-
marised in value functions, bisimulation metrics are a natural candidate for auxiliary tasks in deep
reinforcement learning. Gelada et al. [2019], Agarwal et al. [2021a], and Zhang et al. [2021] introduce
auxiliary tasks based on bisimulation metrics, but require additional assumptions on the underlying
MDP in order for the metric to be learnt correctly (Lipschitz continuity, deterministic, and Gaussian
transitions, respectively). The success of these approaches provides motivation in this paper to
introduce a notion of state similarity applicable to arbitrary MDPs, without further restriction. Further,
we learn this state similarity explicitly: that is, without the aid of any additional network parameters.

3 Advantages and limitations of the bisimulation metric

The bisimulation metric d∼ is a strong notion of distance on the state space of an MDP; it is useful in
policy transfer through its bound on optimal value functions [Castro and Precup, 2010] and because
it is so stringent, it gives good guarantees for state aggregations [Ferns et al., 2004, Li et al., 2006].
However, it has been difficult to use at scale and compute online, for a variety of reasons that we
summarize below.
(i) Computational complexity. The metric can be computed via fixed-point iteration since the opera-
tor TK is a contraction mapping. The map TK contracts at rate γ with respect to the L∞ norm on M,
and therefore obtaining an ε-approximation of d∼ under this norm requires O(log(1/ε)/ log(1/γ))
applications of TK to an initial pseudometric d0 . The cost of each application of TK is dominated by
the computation of |X |2 |A| Wd distances for distributions over X , each costing Õ(|X |2.5 ) in theory
[Lee and Sidford, 2014], and Õ(|X |3 ) in practice [Pele and Werman, 2009, Guo et al., 2020a, Peyré
and Cuturi, 2019]. Thus, the overall practical cost is Õ(|X |5 |A| log(ε)/ log(γ)).
(ii) Bias under sampled transitions. Computing TK requires access to the transition probability
distributions Pxa for each (x, a) ∈ X ×A which, as mentioned in Section 2, are typically not available;
instead, stochastic approximations to the operator of interest are employed. Whilst there has been
work in studying online, sample-based approximate computation of the bisimulation metric [Ferns
et al., 2006, Comanici et al., 2012], these methods are generally biased, in contrast to sample-based
estimation of standard RL operators.
(iii) Lack of connection to non-optimal policies. One of the principal behavioural characterisations
of the bisimulation metric d∼ is the upper bound shown in Equation (1). However, in general we do
not have |V π (x) − V π (y)| ≤ d∼ (x, y) for arbitrary policies π ∈ Π; a simple example is illustrated
in Figure 2. More generally, notions of state similarity that the bisimulation metric encodes may not
be closely related to behavioural similarity under an arbitrary policy π. Thus, learning about d∼ may
not in itself be useful for large-scale reinforcement learning agents.
Property (i) expresses the intrinsic computa-
tional difficulty of computing this metric. Prop-
erty (ii) illustrates the problems associated with
attempting to move from operator-based com-
putation to online, sampled-based computation
of the metric (for example, when the environ-
ment dynamics are unknown). Finally, prop- Figure 2: MDP illustrating that the upper∼bound for
erty (iii) shows that even if the metric is com- any π is−1 not generally satisfied. Here, d (x, y) =
putable exactly, the information it yields about (1 − γ) π , but forπ π(b|x) = 1, π(a|y) −1
= 1, we
the MDP may not be practically useful. Al- have |V (x) − V (y)| = k(1 − γ) .
though π-bisimulation (introduced by Castro
[2020] and extended by Zhang et al. [2021]) addresses property (iii), their practical algorithms
are limited to MDPs with deterministic transitions [Castro, 2020] or MDPs with Gaussian transition
kernels [Zhang et al., 2021]. Taken together, these three properties motivate the search for a metric
without these shortcomings, which can be used in combination with deep reinforcement learning.

3
4 The MICo distance
We now present a new notion of distance for state similarity, which we refer to as MICo (Matching
under Independent Couplings), designed to overcome the drawbacks described above.
Motivated by the drawbacks described in Section 3, we make several modifications to the operator TK
introduced above: (i) in order to deal with the prohibitive cost of computing the Kantorovich distance,
which optimizes over all coupling of the distributions Pxa and Pya , we use the independent coupling;
(ii) to deal with lack of connection to non-optimal policies, we consider an on-policy variant of the
metric, pertaining to a chosen policy π ∈ P(A)X . This leads us to the following definition.
Definition 4.1. Given π ∈ P(A)X , the MICo update operator TM π
: RX ×X → RX ×X is:
π
(TM U )(x, y) = |rxπ − ryπ | + γEx0 ∼Pxπ [U (x0 , y 0 )] (2)
y 0 ∼Pyπ

for all U : X × X → R, with rxπ = π(a|x)rxa and Pxπ = π(a|x)Pxa (·) for all x ∈ X .
P P
a∈A a∈A

As with the bisimulation operator, this can be thought of as encoding desired properties of a notion of
similarity between states in a self-referential manner; the similarity of two states x, y ∈ X should be
determined by the similarity of the rewards and the similarity of the states they lead to.
Proposition 4.2. The operator TM π
is a contraction mapping on RX ×X with respect to the L∞ norm.
Proof. See Appendix B.

The following corollary now follows immediately from Banach’s fixed-point theorem and the com-
pleteness of RX ×X under the L∞ norm.
Corollary 4.3. The MICo operator TM π
has a unique fixed point U π ∈ RX ×X , and repeated applica-
tion of TM to any initial function U ∈ RX ×X converges to U π .
π

Having defined a new operator, and shown that it has a corresponding fixed-point, there are two
questions to address: Does this new notion of distance overcome the drawbacks of the bisimulation
metric described above; and what does this new object tell us about the underlying MDP?

4.1 Addressing the drawbacks of the bisimulation metric

We introduced the MICo distance as a means of overcoming some of the shortcomings associated
with the bisimulation metric, described in Section 3. In this section, we provide a series of results that
show that the newly-defined notion of distance addressess each of these shortcomings. The proofs of
these results rely on the following lemma, connecting the MICo operator to a lifted MDP. This result
is crucial for much of the analysis that follows, so we describe the proof in full detail.
π
Lemma 4.4 (Lifted MDP). The MICo operator TM is the Bellman evaluation operator for an
auxiliary MDP.
Proof. Given the MDP specified by the tuple (X , A, P, R), we construct an auxiliary MDP
(Xe, A, e by taking the state space to be Xe = X 2 , the action space to be Ae = A2 , the transition
e Pe, R),
(a,b)
dynamics to be given by Pe(u,v) ((x, y)) = Pua (x)Pvb (y) for all (x, y), (u, v) ∈ X 2 , a, b ∈ A, and
the action-independent rewards to be R e(x,y) = |rxπ − ryπ | for all x, y ∈ X . The Bellman evaluation
operator Teπ̃ for this auxiliary MDP at discount rate γ under the policy π̃(a, b|x, y) = π(a|x)π(b|y)
is given by (for all U ∈ RX ×X and (x, y) ∈ X × X ):
X (a,b)
(Teπ̃ U )(x, y) = R
e(x,y) +γ Pe(x,y) ((x0 , y 0 ))π̃(a, b|x, y)U (x0 , y 0 )
(x0 ,y 0 )∈X 2
X
= |rxπ − ryπ | + γ Pxπ (x0 )Pyπ (y 0 )U (x0 , y 0 ) = (TM
π
U )(x, y) .
(x0 ,y 0 )∈X 2

Remark 4.5. Ferns and Precup [2014] noted that the bisimulation metric can be interpreted as the
optimal value function in a related MDP, and that the functional TK of TK can be interpreted as a
Bellman optimality operator. However, their proof was non-constructive, the related MDP being
characterised via the solution of an optimal transport problem. In contrast, the connection described

4
above is constructive, and will be useful in understanding many of the theoretical properties of MICo.
Ferns and Precup [2014] also note that the Wd distance in the definition of TK can be upper-bounded
by taking a restricted class of couplings of the transition distributions. The MICo metric can be
viewed as restricting the coupling class precisely to the singleton containing the independent coupling.

With Lemma 4.4 established, we can now address each of the points (i), (ii), and (iii) from Section 3.
(i) Computational complexity. The key result regarding the computational complexity of computing
the MICo distance is as follows.
Proposition 4.6 (MICo computational complexity). The computational complexity of computing
an ε-approximation in L∞ to the MICo metric is O(|X |4 log(ε)/ log(γ)).
Proof. Since, by Proposition 4.2, the operator TM π
is a γ-contraction under L∞ , we require
O(log(ε)/ log(γ)) applications of the operator to obtain an ε-approximation in L∞ . Each itera-
tion of value iteration updates |X |2 table entries, and the cost of each update is O(|X |2 ), leading to
an overall cost of O(|X |4 log(ε)/ log(γ)).

In contrast to the bisimulation metric, this represents a computational saving of O(|X |), which arises
from the lack of a need to solve optimal transport problems over the state space in computing the
MICo distance. There is a further saving of O(|A|) that arises since MICo focuses on an individual
policy π, and so does not require the max over actions in the bisimulation operator definition.
π
(ii) Online approximation. Due to the interpretation of the MICo operator TM as the Bellman
evaluation operator in an auxiliary MDP, established in Lemma 4.4, algorithms and associated proofs
of correctness for computing the MICo distance online can be straightforwardly derived from standard
online algorithms for policy evaluation. We describe a straightforward approach, based on the TD(0)
algorithm, and also note that the wide range of online policy evaluation methods incorporating
off-policy corrections and multi-step returns, as well as techniques for applying such methods at
scale, may also be used.
π
Given a current estimate Ut of the fixed point of TM and a pair of observations (x, a, r, x0 ), (y, b, r̃, y 0 )
generated under π, we can define a new estimate Ut+1 via

Ut+1 (x, y) ← (1 − t (x, y))Ut (x, y) + t (x, y)(|r − r̃| + γUt (x0 , y 0 )) (3)

and Ut+1 (x̃, ỹ) = Ut (x̃, ỹ) for all other state-pairs (x̃, ỹ) 6= (x, y), for some sequence of stepsizes
{t (x, y) | t ≥ 0, (x, y) ∈ X 2 }. Sufficient conditions for convergence of this algorithm can be
deduced straightforwardly from corresponding conditions for TD(0). We state one such result below.
An important caveat is that the correctness of this particular algorithm depends on rewards depending
only on state; one can switch to state-action metrics if this hypothesis is not satisfied.
Proposition 4.7. Suppose rewards depend only on state, and consider the sequence of esti-
mates (Ut )t≥0 , with U0 initialised arbitrarily, and Ut+1 updated from Ut via a pair of transitions
(xt , at , rt , x0t ), (yt , bt , r̃t , yt0 ) as in Equation (3). If all state-pairs tuples are updated infinitely often,
and stepsizes for these updates satisfy the Robbins-Monro conditions. Then Ut → U π almost surely.
Proof. Under the assumptions of the proposition, the update described is exactly a TD(0) update in
the lifted MDP described in Lemma 4.4. We can therefore appeal to Proposition 4.5 of Bertsekas and
Tsitsiklis [1996] to obtain the result.

Thus, in contrast to the Kantorovich metric, convergence to the exact MICo metric is possible with an
online algorithm that uses sampled transitions.
(iii) Relationship to underlying policy. In contrast to the bisimulation metric, we have the following
on-policy guarantee for the MICo metric.
Proposition 4.8. For any π ∈ P(A)X and states x, y ∈ X , we have |V π (x) − V π (y)| ≤ U π (x, y).
Proof. We apply a coinductive argument [Kozen, 2007] to show that if |V π (x) − V π (y)| ≤
U (x, y) for all x, y ∈ X , for some U ∈ RX ×X symmetric in its two arguments, then we also have
|V π (x) − V π (y)| ≤ (TM π
U )(x, y) for all x, y ∈ X . Since the hypothesis holds for the constant
π
function U (x, y) = 2 maxz,a |r(z, a)|/(1 − γ), and TM contracts around U π , the conclusion then

5
follows. Therefore, suppose the hypothesis holds. Then we have
X X
V π (x) − V π (y) = rxπ − ryπ + γ Pxπ (x0 )V (x0 ) − γ Pyπ (y 0 )V (y 0 )
x0 ∈X y 0 ∈X
X
≤ |rxπ − ryπ | + γ Pxπ (x0 )Pyπ (y 0 )(V π (x0 ) − V π (y 0 ))
x0 ,y 0 ∈X
X
≤ |rxπ − ryπ | +γ Pxπ (x0 )Pyπ (y 0 )U (x0 , y 0 ) = (TM
π
U )(x, y) .
x0 ,y 0 ∈X

By symmetry, V π (y) − V π (x) ≤ (TM


π
U )(x, y), as required.

4.2 Diffuse metrics

To characterize the nature of the fixed point U π , we introduce the notion of a diffuse metric.
Definition 4.9. Given a set X , a function d : X × X → R is a diffuse metric if the following
axioms hold: (i) d(x, y) ≥ 0 for any x, y ∈ X ; (ii) d(x, y) = d(y, x) for any x, y ∈ X ; (iii)
d(x, y) ≤ d(x, z) + d(y, z) ∀x, y, z ∈ X .

These differ from the standard metric axioms in the first point: we no longer require that a point
has zero self-distance, and two distinct points may have zero distance. Notions of this kind are
increasingly common in machine learning as researchers develop more computationally tractable
versions of distances, as with entropy-regularised optimal transport distances [Cuturi, 2013], which
also do not satisfy the axiom of zero self-distance.
An example of a diffuse metric is the Łukaszyk–Karmowski distance [Łukaszyk, 2004], which is
used in the MICo metric as the operator between the next-state distributions. Given a diffuse metric
space (X , ρ), the Łukaszyk–Karmowski distance dρLK is a diffuse metric on probability measures
on X given by dρLK (µ, ν) = Ex∼µ,y∼ν [ρ(x, y)]. This example demonstrates the origin of the name
diffuse metrics: the non-zero self distances arises from a point being spread across a probability
distribution. In terms of the Łukaszyk–Karmowski distance, the MICo distance can be written as the
fixed point U π (x, y) = |rxπ − ryπ | + dLK (U π )(Pxπ , Pyπ ). This characterisation leads to the following
result.
Proposition 4.10. The MICo distance is a diffuse metric.
Proof. Non-negativity and symmetry of U π are clear, so it remains to check the triangle inequality.
To do this, we define a sequence of iterates (Uk )k≥0 in RX ×X by U0 (x, y) = 0 for all x, y ∈ X ,
π
and Uk+1 = TM Uk for each k ≥ 0. Recall that by Corollary 4.3 that Uk → U π . We will show
that each Uk satisfies the triangle inequality by induction. By taking limits on either side of the
inequality, we will then recover that U π itself satisfies the triangle inequality. The base case of the
inductive argument is clear from the choice of U0 . For the inductive step, assume that for some k ≥ 0,
Uk (x, y) ≤ Uk (x, z) + Uk (z, y) for all x, y, z ∈ X . Now for any x, y, z ∈ X , we have
Uk+1 (x, y) = |rxπ − ryπ | + γEX 0 ∼Pxπ ,Y 0 ∼Pyπ [Uk (X 0 , Y 0 )]
≤ |rxπ − rzπ | + |rzπ − ryπ | + γEX 0 ∼Pxπ ,Y 0 ∼Pyπ ,Z 0 ∼Pzπ [Uk (X 0 , Z 0 ) + Uk (Z 0 , Y 0 )]
= Uk+1 (x, z) + Uk+1 (z, y) .

It is interesting to note that a state x ∈ X has zero self-distance iff the Markov chain induced
by π initialised at x is deterministic, and the magnitude of a state’s self-distance is indicative of
the amount of “dispersion” in the distribution. Hence, in general, we have U π (x, x) > 0, and
U π (x, x) 6= U π (y, y) for distinct states x, y ∈ X . See the appendix for further discussion of diffuse
metrics and related constructions.

5 The MICo loss


The impetus of our work is the development of principled mechanisms for directly shaping the
representations used by RL agents so as to improve their learning. In this section we present a novel
π
loss based on the MICo update operator TM given in Equation (2) that can be incorporated into any
RL agent. Given the fact that MICo is a diffuse metric that can admit non-zero self-distances, special

6
Figure 3: Left: Illustration of network architecture for learning MICo; Right: The projection of
MICo distances onto representation space.
care needs to be taken in how these distances are learnt; indeed, traditional mechanisms for measuring
distances between representations (e.g. Euclidean and cosine distances) are geometrically-based and
enforce zero self-distances.
We assume an RL agent learning an estimate Qξ,ω defined by the composition of two function
approximators ψ and φ with parameters ξ and ω, respectively: Qξ,ω (x, ·) = ψξ (φω (x)) (note
that this can be the critic in an actor-critic algorithm such as SAC). We will refer to φω (x) as the
representation of state x and aim to make distances between representations match the MICo distance;
we refer to ψξ as the value approximator. We define the parameterized representation distance, Uω ,
as an approximant to U π :
kφω (x)k22 + kφω (y)k22
U π (x, y) ≈ Uω (x, y) := + βθ(φω (x), φω (y))
2
where θ(φω (x), φω (y)) is the angle between vectors φω (x) and φω (y) and β is a scalar (in our results
we use β = 0.1 but present results with other values of β in the appendix).
Based on Equation (2), our learning target is then Tω̄U (rx , x0 , ry , y 0 ) = |rx − ry | + γUω̄ (x0 , y 0 ), where
ω̄ is a separate copy of the network parameters that are synchronised with ω at infrequent intervals.
This is a common practice that was introduced by Mnih et al. [2015] (and in fact, we use the same
update schedule they propose). The loss for this learning target is
h 2 i
LMICo (ω) =Ehx,rx ,x0 i,hy,ry ,y0 i Tω̄U (rx , x0 , ry , y 0 ) − Uω (x, y)

where hx, rx , x0 i and hy, ry , y 0 i are pairs of transitions sampled from the agent’s replay buffer. We
can combine LMICo with the temporal-difference loss LTD of any RL agent as (1 − α)LTD + αLMICo ,
where α ∈ (0, 1). Each sampled mini-batch is used for both MICo and TD losses. Figure 3 (left)
illustrates the network architecture used for learning.
Although the loss LMICo is designed to learn the MICo diffuse metric U π , the values of the metric
itself are parametrised through Uω defined above, which is constituted by several distinct terms. This
appears to leave a question as to how the representations φω (x) and φω (y), as Euclidean vectors, are
related to one another when the MICo loss is minimised. Careful inspection of the form of Uω (x, y)
shows that the (scaled) angular distance between φω (x) and φω (y) can be recovered from Uω by
subtracting the learnt approximations to the self-distances U π (x, x) and U π (y, y) (see Figure 3,
right). We therefore define the reduced MICo distance ΠU π , which encodes the distances enforced
between the representation vectors φω (x) and φω (y), by:
1 1
βθ(φω (x), φω (y)) ≈ ΠU π (x, y) = U π (x, y) − U π (x, x) − U π (y, y) .
2 2
In the following section we investigate the following two questions: (1) How informative of V π is
ΠU π ?; and (2) How useful are the features encountered by ΠU π for policy evaluation? We conduct
these investigations on tabular environments where we can compute the metrics exactly, which helps
clarify the behaviour of our loss when combined with deep networks in Section 6.

5.1 Value bound gaps

Although Proposition 4.8 states that we have |V π (x) − V π (y)| ≤ U π (x, y), we do not, in general,
have the same upper bound for ΠU π (x, y) as demonstrated by the following result.

7
4 d
U Metric
d

Linear regression, avg. error


U
3 0.008 U

Avg. d(x, y) |V (x) V (y)|


PVF
RF
2
0.006

0.004
1

0.002
0

0.000
2 4 6 8 10 12 14 16 18 20 0 15 30 45 60 75 90 105
Number of states Number of features

Figure 4: Left: The gap between the difference in values and the various distances for Garnet MDPs
with varying numbers of actions (represented by circle sizes); Right: Average error when performing
linear regression on varying numbers of features in the four-rooms GridWorld, averaged over 10 runs;
shaded areas represent 95% confidence intervals.
Lemma 5.1. There exists an MDP with x, y ∈ X , and π ∈ Π where |V π (x) − V π (y)| > ΠU π (x, y).
Proof. Consider a single-action MDP with two states (x and y) where y is absorbing, x transitions
with equal probability to x and y, and a reward of 1 is received only upon taking an action from state
x. There is only one policy for this MDP which yields the value function V (x) ≈ 1.8 and V (y) = 0.
The MICo distance gives U (x, x) ≈ 1.06, U (x, y) ≈ 1.82, and U (y, y) = 0, while the reduced MICo
distance yields ΠU (x, x) = ΠU (y, y) = 0, and ΠU (x, y) ≈ 1.29 < |V (x) − V (y)| = 1.8.

Despite this negative result, it is worth evaluating how often in practice this inequality is violated and
by how much, as this directly impacts the utility of this distance for learning representations.
To do so, we make use of Garnet MDPs, a class of randomly generated MDPs [Archibald et al.,
1995, Piot et al., 2014]. Given a specified number of states nX and the number of actions nA ,
Garnet(nX , nA ) is generated as follows: 1. The branching factor bx,a of each transition Pxa is
sampled uniformly from [1 : nX ]. 2. bx,a states are picked uniformly randomly from X and assigned
a random value in [0, 1]; these values are then normalized to produce a proper distribution Pxa .
3. Each rxa is sampled uniformly in [0, 1].
For eachPGarnet(n X , nA ) we sample 100 stochastic policies {πi } and compute the average gap:
1
y) − |V πi (x) − V πi (y)|, where d stands for any of the considered metrics.
P
100|X | 2 i x,y d(x,
Note we are measuring the signed difference, as we are interested in the frequency with which the
upper bound is violated. As seen in Figure 4 (left), our metric does on average provide an upper bound
on the difference in values that is also tighter bound than those provided by U π and π-bisimulation.
This suggests that the resulting representations remain informative of value similarities.

5.2 State features

In order to investigate the usefuleness of the representations produced by ΠU π , we construct state


features directly by using the computed distances to project the states into a lower-dimensional space
with the UMAP dimensionality reduction algorithm [McInnes et al., 2018]2 . We then apply linear
regression of the true value function V π against the features to compute Vˆπ and measure the average
error across the state space. As baselines we compare against random features (RF), Proto Value
Functions (PVF) [Mahadevan and Maggioni, 2007], and the features produced by π-bisimulation
[Castro, 2020]. We present our results on the well-known four-rooms GridWorld [Sutton et al., 1999]
in Figure 4 (right) and provide results on more environments in the appendix. Despite the independent
couplings, ΠU π performs on par with π-bisimulation, which optimizes over all couplings.

6 Large-scale empirical evaluation


Having developed a greater understanding of the properties inherent to the representations produced
by the MICo loss, we evaluate it on the Arcade Learning Environment [Bellemare et al., 2013].
We added the MICo loss to all the JAX agents provided in the Dopamine library [Castro et al.,
2018]: DQN [Mnih et al., 2015], Rainbow [Hessel et al., 2018], QR-DQN [Dabney et al., 2018b],
2
Note that since UMAP expects a metric, it is ill-defined with the diffuse metric U π .

8
and IQN [Dabney et al., 2018a], using mean squared error loss to minimize LTD for DQN (as
suggested by Obando-Ceron and Castro [2021]). Given the state-of-the-art results demonstrated by
the Munchausen-IQN (M-IQN) agent [Vieillard et al., 2020], we also evaluated incorporating our loss
into M-IQN.3 For all experiments we used the hyperparameter settings provided with Dopamine. We
found that a value of α = 0.5 worked well with quantile-based agents (QR-DQN, IQN, and M-IQN),
while a value of α = 0.01 worked well with DQN and Rainbow. We hypothesise that the difference
in scale of the quantile, categorical, and non-distributional loss functions concerned leads to these
distinct values of α performing well. We found it important to use the Huber loss [Huber, 1964]
to minimize LMICo as this emphasizes greater accuracy for smaller distances as oppoosed to larger
distances. We experimented using the MSE loss but found that larger distances tended to overwhelm
the optimization process, thereby degrading performance.
We evaluated on all 60 Atari 2600 games over 5 seeds and report the results in Figure 1 (left), using
the interquantile metric (IQM), proposed by Agarwal et al. [2021b] as a more robust and reliable
alternative to mean and median (which are reported in Figure 6). The fact that the MICo loss provides
consistent improvements over a wide range of baseline agents of varying complexity suggests that
the MICo loss can help learn better representations for control.
Additionally, we evaluated the MICo loss on twelve of the DM-Control suite from pixels environments
[Tassa et al., 2018]. As a base agent we used Soft Actor-Critic (SAC) [Haarnoja et al., 2018] with the
convolutional auto-encoder described by Yarats et al. [2019]. We applied the MICo loss on the output
of the auto-encoder (with α = 1e − 5) and maintained all other parameters untouched. Recently,
Zhang et al. [2021] introduced DBC, which learns a dynamics and reward model on the output of
the auto-encoder; their bisimulation loss uses the learned dynamics model in the computation of the
Kantorovich distance between the next state transitions. We consider two variants of their algorithm:
one which learns a stochastic dynamics model (DBC), and one which learns a deterministic dynamics
model (DBC-Det). We replaced their bisimulation loss with the MICo loss (which, importantly, does
not require a dynamics model) and kept all other parameters untouched. As Figure 1 illustrates, the
best performance is achieved with SAC augmented with the MICo loss; additionally, replacing the
bisimulation loss of DBC with the MICo loss is able to recover the performance of DBC to match
that of SAC.
Additional details and results are provided in the appendix.

7 Related Work

Bisimulation metrics were introduced for MDPs by Ferns et al. [2004], and have been extended in a
number of directions [Ferns et al., 2005, 2006, Taylor, 2008, Taylor et al., 2009, Ferns et al., 2011,
Comanici et al., 2012, Bacci et al., 2013a,b, Abate, 2013, Ferns and Precup, 2014, Castro, 2020], with
applications including policy transfer [Castro and Precup, 2010, Santara et al., 2019], representation
learning [Ruan et al., 2015, Comanici et al., 2015], and state aggregation [Li et al., 2006].
A range of other notions of similarity in MDPs have also been considered, such as action sequence
equivalence [Givan et al., 2003], temporally extended metrics [Amortila et al., 2019], MDP homomor-
phisms [Ravindran and Barto, 2003], utile distinction [McCallum, 1996], and policy irrelevance [Jong
and Stone, 2005], as well as notions of policy similarity [Pacchiano et al., 2020, Moskovitz et al.,
2021]. Li et al. [2006] review different notions of similarity applied to state aggregation. Recently, Le
Lan et al. [2021] performed an exhaustive analysis of the continuity properties, relative to functions
of interest in RL, of a number of existing metrics in the literature.
The notion of zero self-distance, central to the diffuse metrics defined in this paper, is increasingly
encountered in machine learning applications involving approximation of losses. Of particular note
is entropy-regularised optimal transport [Cuturi, 2013] and related quantities [Genevay et al., 2018,
Fatras et al., 2020, Chizat et al., 2020, Fatras et al., 2021].
More broadly, many approaches to representation learning in deep RL have been considered, such as
those based on auxiliary tasks (see e.g. [Sutton et al., 2011, Jaderberg et al., 2017, Bellemare et al.,

3
Given that the authors of M-IQN had implemented their agent in TensorFlow (whereas our agents are in
JAX), we have reimplemented M-IQN in JAX and run 5 independent runs (in contrast to the 3 run by Vieillard
et al. [2020].

9
2017, François-Lavet et al., 2019, Gelada et al., 2019, Guo et al., 2020b]), and other approaches such
as successor features [Dayan, 1993, Barreto et al., 2017].

8 Conclusion
In this paper, we have introduced the MICo distance, a notion of state similarity that can be learnt
at scale and from samples. We have studied the theoretical properties of MICo, and proposed
a new loss to make the non-zero self-distances of this diffuse metric compatible with function
approximation, combining it with a variety of deep RL agents to obtain strong performance on
the Arcade Learning Environment. In contrast to auxiliary losses that implicitly shape an agent’s
representation, MICo directly modifies the features learnt by a deep RL agent; our results indicate that
this helps improve performance. To the best of our knowledge, this is the first time directly shaping
the representation of RL agents has been successfully applied at scale. We believe this represents an
interesting new approach to representation learning in RL; continuing to develop theory, algorithms
and implementations for direct representation shaping in deep RL is an important and promising
direction for future work.
Broader impact statement
This work lies in the realm of “foundational RL” in that it contributes to the fundamental understanding
and development of reinforcement learning algorithms and theory. As such, despite us agreeing in
the importance of this discussion, our work is quite far removed from ethical issues and potential
societal consequences.

9 Acknowledgements
The authors would like to thank Gheorghe Comanici, Rishabh Agarwal, Nino Vieillard, and Matthieu
Geist for their valuable feedback on the paper and experiments. Pablo Samuel Castro would like to
thank Roman Novak and Jascha Sohl-Dickstein for their help in getting angular distances to work
stably! Thanks to Hongyu Zang for pointing out that the x-axis labels for the SAC experiments needed
to be fixed. Finally, the authors would like to thank the reviewers (both ICML’21 and NeurIPS’21)
for helping make this paper better.

10
References
Alessandro Abate. Approximation metrics based on probabilistic bisimulations for general state-space
Markov processes: A survey. Electr. Notes Theor. Comput. Sci., 297:3–25, 2013.
Rishabh Agarwal, Marlos C. Machado, Pablo Samuel Castro, and Marc G. Bellemare. Contrastive
behavioral similarity embeddings for generalization in reinforcement learning. In International
Conference on Learning Representations (ICLR), 2021a.
Rishabh Agarwal, Max Schwarzer, Pablo Samuel Castro, Aaron Courville, and Marc G Bellemare.
Deep reinforcement learning at the edge of the statistical precipice. Advances in Neural Information
Processing Systems (NeurIPS), 2021b.
Philip Amortila, Marc G Bellemare, Prakash Panangaden, and Doina Precup. Temporally extended
metrics for Markov decision processes. In AAAI Workshop on Safe AI, 2019.
T. W. Archibald, K. I. M. McKinnon, and L. C. Thomas. On the generation of Markov decision
processes. The Journal of the Operational Research Society, 46(3):354–361, 1995.
Giorgio Bacci, Giovanni Bacci, Kim G Larsen, and Radu Mardare. Computing behavioral distances,
compositionally. In International Symposium on Mathematical Foundations of Computer Science
(MFCS), 2013a.
Giorgio Bacci, Giovanni Bacci, Kim G Larsen, and Radu Mardare. On-the-fly exact computation of
bisimilarity distances. In International Conference on Tools and Algorithms for the Construction
and Analysis of Systems (TACAS), 2013b.
André Barreto, Will Dabney, Rémi Munos, Jonathan J Hunt, Tom Schaul, Hado Van Hasselt, and
David Silver. Successor features for transfer in reinforcement learning. In Advances in Neural
Information Processing Systems (NIPS), 2017.
Marc G. Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The Arcade Learning
Environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research,
47:253–279, June 2013.
Marc G Bellemare, Will Dabney, and Rémi Munos. A distributional perspective on reinforcement
learning. In International Conference on Machine Learning (ICML), 2017.
Dimitri P. Bertsekas and John N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific, 1996.
Pablo Samuel Castro. Scalable methods for computing state similarity in deterministic Markov
Decision Processes. In AAAI Conference on Artificial Intelligence, 2020.
Pablo Samuel Castro and Doina Precup. Using bisimulation for policy transfer in MDPs. In
International Conference on Autonomous Agents and Multiagent Systems (AAMAS), 2010.
Pablo Samuel Castro, Subhodeep Moitra, Carles Gelada, Saurabh Kumar, and Marc G. Bellemare.
Dopamine: A Research Framework for Deep Reinforcement Learning. 2018.
Lenaic Chizat, Pierre Roussillon, Flavien Léger, François-Xavier Vialard, and Gabriel Peyré. Faster
wasserstein distance estimation with the sinkhorn divergence. Advances in Neural Information
Processing Systems (NeurIPS), 2020.
Gheorghe Comanici, Prakash Panangaden, and Doina Precup. On-the-fly algorithms for bisimulation
metrics. In International Conference on Quantitative Evaluation of Systems (QEST), 2012.
Gheorghe Comanici, Doina Precup, and Prakash Panangaden. Basis refinement strategies for linear
value function approximation in MDPs. In Advances in Neural Information Processing Systems
(NIPS), 2015.
Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. Advances in Neural
Information Processing Systems (NIPS), 2013.
Will Dabney, Georg Ostrovski, David Silver, and Rémi Munos. Implicit quantile networks for
distributional reinforcement learning. In International Conference on Machine Learning (ICML),
2018a.

11
Will Dabney, Mark Rowland, Marc Bellemare, and Rémi Munos. Distributional reinforcement
learning with quantile regression. In AAAI Conference on Artificial Intelligence, 2018b.

Peter Dayan. Improving generalization for temporal difference learning: The successor representation.
Neural Comput., 5(4):613–624, July 1993.

Josée Desharnais, Vineet Gupta, Radhakrishnan Jagadeesan, and Prakash Panangaden. Metrics for
labeled Markov systems. In International Conference on Concurrency Theory (CONCUR), 1999.

Josée Desharnais, Vineet Gupta, Radhakrishnan Jagadeesan, and Prakash Panangaden. A metric for
labelled Markov processes. Theoretical Computer Science, 318(3):323–354, June 2004.

Kilian Fatras, Younes Zine, Rémi Flamary, Rémi Gribonval, and Nicolas Courty. Learning with
minibatch Wasserstein: asymptotic and gradient properties. In International Conference on
Artificial Intelligence and Statistics (AISTATS), 2020.

Kilian Fatras, Younes Zine, Szymon Majewski, Rémi Flamary, Rémi Gribonval, and Nicolas Courty.
Minibatch optimal transport distances; analysis and applications. arXiv, 2021.

William Fedus, Carles Gelada, Yoshua Bengio, Marc G Bellemare, and Hugo Larochelle. Hyperbolic
discounting and learning over multiple horizons. arXiv, 2019.

Norm Ferns and Doina Precup. Bisimulation metrics are optimal value functions. In Conference on
Uncertainty in Artificial Intelligence (UAI), 2014.

Norm Ferns, Prakash Panangaden, and Doina Precup. Metrics for finite Markov decision processes.
In Conference on Uncertainty in Artificial Intelligence (UAI), 2004.

Norm Ferns, Pablo Samuel Castro, Doina Precup, and Prakash Panangaden. Methods for comput-
ing state similarity in Markov decision processes. In Conference on Uncertainty in Artificial
Intelligence (UAI), 2006.

Norm Ferns, Prakash Panangaden, and Doina Precup. Bisimulation metrics for continuous Markov
decision processes. SIAM Journal on Computing, 40(6):1662–1714, 2011.

Norman Ferns, Prakash Panangaden, and Doina Precup. Metrics for Markov decision processes with
infinite state spaces. In Conference on Uncertainty in Artificial Intelligence (UAI), 2005.

Vincent François-Lavet, Yoshua Bengio, Doina Precup, and Joelle Pineau. Combined reinforcement
learning via abstract representations. In AAAI Conference on Artificial Intelligence, 2019.

Carles Gelada, Saurabh Kumar, Jacob Buckman, Ofir Nachum, and Marc G Bellemare. DeepMDP:
Learning continuous latent space models for representation learning. In International Conference
on Machine Learning (ICML), 2019.

Aude Genevay, Gabriel Peyré, and Marco Cuturi. Learning generative models with sinkhorn diver-
gences. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2018.

Robert Givan, Thomas Dean, and Matthew Greig. Equivalence notions and model minimization in
Markov decision processes. Artificial Intelligence, 147(1-2):163–223, 2003.

Wenshuo Guo, Nhat Ho, and Michael I. Jordan. Fast algorithms for computational optimal transport
and Wasserstein barycenter. In International Conference on Artificial Intelligence and Statistics
(AISTATS), 2020a.

Zhaohan Daniel Guo, Bernardo Avila Pires, Bilal Piot, Jean-Bastien Grill, Florent Altché, Rémi
Munos, and Mohammad Gheshlaghi Azar. Bootstrap latent-predictive representations for multitask
reinforcement learning. In International Conference on Machine Learning (ICML), 2020b.

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy
maximum entropy deep reinforcement learning with a stochastic actor. In International Conference
on Machine Learning (ICML), 2018.

12
Matteo Hessel, Joseph Modayil, Hado van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan
Horgan, Bilal Piot, Mohammad Azar, and David Silver. Rainbow: Combining Improvements in
Deep Reinforcement learning. In AAAI Conference on Artificial Intelligence, 2018.

Peter J. Huber. Robust Estimation of a Location Parameter. The Annals of Mathematical Statistics,
35(1):73 – 101, 1964.

Max Jaderberg, Volodymyr Mnih, Wojciech Marian Czarnecki, Tom Schaul, Joel Z Leibo, David
Silver, and Koray Kavukcuoglu. Reinforcement learning with unsupervised auxiliary tasks. In
International Conference on Learning Representations (ICLR), 2017.

Nicholas K Jong and Peter Stone. State abstraction discovery from irrelevant state variables. In
International Joint Conference on Artificial Intelligence (IJCAI), 2005.

Dexter Kozen. Coinductive proof principles for stochastic processes. Logical methods in computer
science, 3, November 2007.

Kim G Larsen and Arne Skou. Bisimulation through probablistic testing. Information and Computa-
tion, 94:1–28, 1991.

Charline Le Lan, Marc G. Bellemare, and Pablo Samuel Castro. Metrics and continuity in reinforce-
ment learning. In AAAI Conference on Artificial Intelligence, 2021.

Yin Tat Lee and Aaron


√ Sidford. Path finding methods for linear programming: Solving linear
programs in Õ( rank) iterations and faster algorithms for maximum flow. In IEEE Annual
Symposium on Foundations of Computer Science (FOCS), 2014.

Lihong Li, Thomas J Walsh, and Michael L Littman. Towards a unified theory of state abstraction for
MDPs. In International Symposium on Artificial Intelligence and Mathematics (ISAIM), 2006.

Marlos C Machado, Marc G Bellemare, Erik Talvitie, Joel Veness, Matthew Hausknecht, and Michael
Bowling. Revisiting the Arcade Learning Environment: Evaluation protocols and open problems
for general agents. Journal of Artificial Intelligence Research, 61:523–562, 2018.

Sridhar Mahadevan and Mauro Maggioni. Proto-value functions: A Laplacian framework for learning
representation and control in Markov decision processes. Journal of Machine Learning Research,
8:2169–2231, December 2007.

Steve Matthews. Partial metric topology. Annals of the New York Academy of Sciences, 728(1):
183–197, 1994.

Andrew Kachites McCallum. Reinforcement Learning with Selective Perception and Hidden State.
PhD thesis, The University of Rochester, 1996.

Leland McInnes, John Healy, Nathaniel Saul, and Lukas Grossberger. UMAP: Uniform manifold
approximation and projection. The Journal of Open Source Software, 3(29):861, 2018.

R. Milner. Communication and Concurrency. Prentice-Hall, 1989.

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare,
Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, Stig Petersen, Charles
Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane
Legg, and Demis Hassabis. Human-level control through deep reinforcement learning. Nature,
518(7540):529–533, 2015.

Ted Moskovitz, Michael Arbel, Ferenc Huszar, and Arthur Gretton. Efficient Wasserstein natural
gradients for reinforcement learning. In International Conference on Learning Representations
(ICLR), 2021.

Johan S Obando-Ceron and Pablo Samuel Castro. Revisiting rainbow: Promoting more insightful and
inclusive deep reinforcement learning research. In International Conference on Machine Learning
(ICML), 2021.

13
Aldo Pacchiano, Jack Parker-Holder, Yunhao Tang, Krzysztof Choromanski, Anna Choromanska,
and Michael Jordan. Learning to score behaviors for guided policy optimization. In International
Conference on Machine Learning (ICML), 2020.
Ofir Pele and Michael Werman. Fast and robust earth mover’s distances. In IEEE International
Conference on Computer Vision (ICCV), 2009.
Gabriel Peyré and Marco Cuturi. Computational optimal transport. Foundations and Trends® in
Machine Learning, 11(5-6):355–607, 2019.
Bilal Piot, Matthieu Geist, and Olivier Pietquin. Difference of convex functions programming for
reinforcement learning. In Advances in Neural Information Processing Systems (NIPS). 2014.
Balaraman Ravindran and Andrew G. Barto. SMDP homomorphisms: An algebraic approach to
abstraction in semi-Markov decision processes. In International Joint Conference on Artificial
Intelligence (IJCAI), 2003.
Sherry Shanshan Ruan, Gheorghe Comanici, Prakash Panangaden, and Doina Precup. Representation
discovery for MDPs using bisimulation metrics. In AAAI Conference on Artificial Intelligence,
2015.
Anirban Santara, Rishabh Madan, Balaraman Ravindran, and Pabitra Mitra. ExTra: Transfer-guided
exploration. In International Conference on Autonomous Agents and Multiagent Systems (AAMAS),
2019.
Richard S Sutton, Joseph Modayil, Michael Delp, Thomas Degris, Patrick M Pilarski, Adam White,
and Doina Precup. Horde: A scalable real-time architecture for learning knowledge from unsuper-
vised sensorimotor interaction. In International Conference on Autonomous Agents and Multiagent
Systems (AAMAS), 2011.
R.S. Sutton, D. Precup, and S. Singh. Between MDPs and semi-MDPs: A framework for temporal
abstraction in reinforcement learning. Artificial Intelligence, 112:181–211, 1999.
Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Budden,
Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, Timothy P. Lillicrap, and Martin A. Riedmiller.
Deepmind control suite. arXiv, 2018.
Jonathan Taylor. Lax probabilistic bisimulation. Master’s thesis, McGill University, 2008.
Jonathan Taylor, Doina Precup, and Prakash Panagaden. Bounding performance loss in approximate
MDP homomorphisms. In Advances in Neural Information Processing Systems (NIPS), 2009.
Nino Vieillard, Olivier Pietquin, and Matthieu Geist. Munchausen reinforcement learning. In
Advances in Neural Information Processing Systems (NeurIPS), 2020.
Cédric Villani. Optimal Transport. Springer-Verlag Berlin Heidelberg, 2008.
Denis Yarats, Amy Zhang, Ilya Kostrikov, Brandon Amos, Joelle Pineau, and Rob Fergus. Improving
sample efficiency in model-free reinforcement learning from images. arXiv, 2019.
Amy Zhang, Rowan McAllister, Roberto Calandra, Yarin Gal, and Sergey Levine. Invariant rep-
resentations for reinforcement learning without reconstruction. In International Conference on
Learning Representations (ICLR), 2021.
Szymon Łukaszyk. A new concept of probability metric and its applications in approximation of
scattered data sets. Computational Mechanics, 33:299–304, 03 2004.

14
Supplementary Material:
MICo: Improved representations via sampling-based state similarity for
Markov decision processes

A Extended background material

In this section we provide a more extensive background review.

A.1 Reinforcement learning

In this section we give a slightly more expansive overview of relevant key concepts in reinforcement
learning, without the space constraints of the main paper. Denoting by P(S) the set of probability
distributions on a set S, we define a Markov decision process (X , A, γ, P, r) as:
• A finite state space X ;
• A finite action space A;
• A transition kernel P : X × A → P(X );
• A reward function r : X × A → R;
• A discount factor γ ∈ [0, 1).
For notational convenience we introduce the notation Pxa ∈ P(X ) for the next-state distribution
given state-action pair (x, a), and rxa for the corresponding immediate reward.
Policies are mappings from states to distributions over actions: π ∈ P(A)X and induce a value
function V π : X → R defined via the recurrence:
V π (x) := Ea∼π(x) rxa + γEx0 ∼Pxa [V π (x0 )] .
 

It can be shown that this recurrence uniquely defines V π through a contraction mapping argument
[Bertsekas and Tsitsiklis, 1996].
The control problem is concerned with finding the optimal policy
π ∗ = arg max Vπ.
π∈P(A)X

It can be shown that while the optimisation problem above appears to have multiple objectives (one
for each coordinate of V π , there is in fact a policy π ∗ ∈ P(A)X that simultaneously maximises
all coordinates of V π , and that this policy can be taken to be deterministic; that is, for each x ∈ X ,
π(·|x) ∈ P(A) attributes probability 1 to a single action. In reinforcement learning in particular, we
are often interested in finding, or approximating, π ∗ from direct interaction with the MDP in question
via sample trajectories, without knowledge of P or r (and sometimes not even X ).

A.2 Metrics

A metric d on a set X is a function d : X × X → [0, ∞) respecting the following axioms for any
x, y, z ∈ X:
1. Identity of indiscernibles: d(x, y) = 0 ⇐⇒ x = y;
2. Symmetry: d(x, y) = d(y, x);
3. Triangle inequality: d(x, y) ≤ d(x, z) + d(z, y).
A pseudometric is similar, but the ”identity of indiscernibles” axiom is weakened:
1. x = y =⇒ d(x, y) = 0;
2. d(x, y) = d(y, x);
3. d(x, y) ≤ d(x, z) + d(z, y).
Note that the weakened first condition does allow one to have d(x, y) = 0 when x 6= y.
A (pseudo)metric space (X, d) is defined as a set X together with a (pseudo)metric d defined on X.

15
A.3 State similarity and bisimulation metrics

Bisimulation is a fundamental notion of behavioural equivalence introduced by Park and Milner [Mil-
ner, 1989] in the early 1980s in the context of nondeterministic transition systems. The probabilistic
analogue was introduced by Larsen and Skou [1991]. The notion of an equivalence relation is not
suitable to capture the extent to which quantitative systems may resemble each other in behaviour.
To provide a quantitative notion, bisimulation metrics were introduced by Desharnais et al. [1999,
2004] in the context of probabilistic transition systems without rewards. In reinforcement learning
the reward is an important ingredient, accordingly the bisimulation metric for states of MDPs was
introduced by Ferns et al. [2004]. Much work followed this initial introduction of bisimulation metrics
into RL, as described in the main paper. We briefly reviewed the bisimulation metric in Section 2,
and now provide additional detail around some of the key associated mathematical concepts.
Central to the definition of the bisimulation metric is the operator Tk : M(X ) → M(X ), defined
over M(X ), the space of pseudometrics on X . Pseudometrics were explored in more detail in
Section A.2. We now turn to the definition of the operator itself, given by
Tk (d)(x, y) = max[|rxa − rya ] + γWd (Pxa , Pya )] ,
a∈A

for each d ∈ M(X ), and each x, y ∈ X . It can be verified that the function TK (d) : X × X → R
satisfies the properties of a pseudometric, so under this definition TK does indeed map M(X ) into
itself.
The other central mathematical concept underpinning the operator TK is the Wasserstein distance
Wd using base metric d. Wd is formally a pseudometric over the set of probability distributions
P(X ), defined as the solution to an optimisation problem. The problem specifically is formulated as
finding an optimal coupling between the two input probability distributions that minimises a notion
of transport cost associated with d. Mathematically, for two probability distributions µ, µ0 ∈ P(X ),
we have
Wd (µ, µ0 ) = min E[d(Z, Z 0 )] .
(Z,Z 0 )
Z∼µ,Z 0 ∼ν 0

Note that the pair of random variables (Z, Z 0 ) attaining the minimum in the above expression will in
general not be independent. That the minimum is actually attained in the above example in the case
of a finite set X can be seen by expressing the optimisation problem as a linear program. Minima are
obtained in much more general settings too; see Villani [2008].
Finally, the operator TK can be analysed in a similar way to standard operators in dynamic program-
ming for reinforcement learning. It can be shown that it is a contraction mapping with respect to the
L∞ metric over M(X ), and that M(X ) is a complete metric space with respect to the same metric
[Ferns et al., 2011]. Thus, by Banach’s fixed point theorem, TK has a unique fixed point in M(X ),
and repeated application of TK to any initial pseudometric will converge to this fixed point.

A.4 Further details on diffuse and partial metrics

The notion of a distance function having non-zero self distance was first introduced by Matthews
[1994] who called it a partial metric. We define it below:
Definition A.1. Given a set X , a function d : X × X → R is a partial metric if the following axioms
hold: (i) x = y ⇐⇒ d(x, x) = d(y, y) = d(x, y) for any x, y ∈ X ; (ii) d(x, x) ≤ d(y, x) for
any x, y ∈ X ; (iii) d(x, y) = d(y, x) for any x, y ∈ X ; (iv) d(x, y) ≤ d(x, z) + d(y, z) − d(z, z)
∀x, y, z ∈ X .

This definition was introduced to recover a proper metric from the distance function: that is, given a
˜ y) = d(x, y) − 1 (d(x, x) + d(y, y)) is a proper metric.
partial metric d, one is guaranteed that d(x, 2
The above definition is still too stringent for the Łukaszyk–Karmowski distance (and hence MICo
distance), since it fails axiom 4 as shown in the following counterexample.
Example A.2. The Łukaszyk–Karmowski distance does not satisfy the modified triangle inequality:
let X be [0, 1], and ρ be the Euclidean distance | · |. Let µ,ν be Dirac measures concentrated at 0 and
1, and let η be 12 (δ0 + δ1 ). Then one can calculate that dLK (ρ)(µ, ν) = 1, while dLK (ρ)(µ, η) +
dLK (ρ)(ν, η) − dLK (ρ)(η, η) = 1/2, breaking the inequality.

16
This naturally leads us to the notion of diffuse metrics defined in the main paper.

B Proof of Proposition 4.2


π
Proposition 4.2. The operator TM is a contraction mapping on RX ×X with respect to the L∞ norm.

Proof. Let U, U 0 ∈ RX ×X . Then note that


X
|(T π U )(x, y) − (T π U 0 )(x, y)| = γ π(a|x)π(b|y)Pxa (x0 )Pyb (y 0 )(U − U 0 )(x0 , y 0 ) ≤ γkU − U 0 k∞ .
x0 ,y 0

for any x, y ∈ X , as required.

C Experimental details
We will first describe the regular network and training setup for these agents so as to facilitate the
description of our loss.

C.1 Baseline network and loss description

The networks used by Dopamine for the ALE consist of 3 convolutional layers followed by two
fully-connected layers (the output of the networks depends on the agent). We denote the output of the
convolutional layers by φω with parameters ω, and the remaining fully connected layers by ψξ with
parameters ξ. Thus, given an input state x (e.g. a stack of 4 Atari frames), the output of the network
is Qξ,ω (x, ·) = ψξ (φω (x)). Two copies of this network are maintained: an online network and a
target network; we will denote the parameters of the target network by ξ¯ and ω̄. During learning, the
parameters of the online network are updated every 4 environment steps, while the target network
parameters are synced with the online network parameters every 8000 environment steps. We refer to
the loss used by the various agents considered as LTD ; for example, for DQN this would be:
  
0 0
LTD (ξ, ω) := E(x,a,r,x0 )∼D ρ r + γ max0
Qξ̄,ω̄ (x , a ) − Qξ,ω (x, a) ,
a ∈A

where D is a replay buffer with a capacity of 1M transitions, and ρ is the Huber loss.

C.2 MICo loss description

We will be applying the MICo loss to φω (x). As described in Section 5, we express the distance
between two states as:
kφω (x)k22 + kφω̄ (y)k22
Uω (x, y) = + βθ(φω (x), φω̄ (y)) ,
2
where θ(φω (x), φω̄ (y)) is the angle between vectors φω (x) and φω̄ (y) and β is a scalar. Note that we
are using the target network for the y representations; this was done for learning stability. We used
β = 0.1 for the results in the main paper, but present some results with different values of β below.
In order to get a numerically stable operation, we implement the angular distance between representa-
tions φω (x) and φω (y) according to the calculations
hφω (x), φω (y)i
CS(φω (x), φω (y)) =
kφω (x)kkφω (y)k
p 
θ(φω (x), φω (y)) = arctan2 1 − CS(φω (x), φω (y))2 , CS(φω (x), φω (y)) .

Based on Equation (2), our learning target is then (note the target network is used for both representa-
tions here):
Tω̄U (rx , x0 , ry , y 0 ) = |rx − ry | + γUω̄ (x0 , y 0 ) ,

17
and the loss is
h 2 i
LMICo (ω) =Ehx,rx ,x0 i Tω̄U (rx , x0 , ry , y 0 ) − Uω (x, y) ,
∼D
hy,ry ,y 0 i

As mentioned in Section 5, we use the same mini-batch sampled for LTD for computing LMICo .
Specifically, we follow the method introduced by Castro [2020] for constructing new matrices that
allow us to compute the distances between all pairs of sampled states (see code for details on matrix
operations). Our combined loss is then Lα (ξ, ω) = (1 − α)LTD (ξ, ω) + αLMICo (ω).

C.3 Hyperparameters for soft actor-critic

We re-implemented the DBC algorithm from Zhang et al. [2021] on top of the Soft Actor-Critic
algorithm [Haarnoja et al., 2018] provided by the Dopamine library [Castro et al., 2018]. We
compared the following algorithms, using the same hyperparameters for all4 :

1. SAC: This is Soft Actor-Critic [Haarnoja et al., 2018] with the convolutional encoder described
by Yarats et al. [2019].
2. DBC: This is DBC, based on SAC, as described by Zhang et al. [2021].
3. DBC-Det: In the code provided by Zhang et al. [2021], the default setting was to assume
deterministic transitions (which is an easier dynamics model to learn), so we decided to compare
against this version as well. It is interesting to note that the performance is roughly the same as
for DBC.
4. MICo: This a modified version of SAC, adding the MICo loss to the output of the encoder. Note
that the encoder output is the same one used by DBC for their dynamics and reward models.
5. DBC+MICo: Instead of using the bisimulation loss of Zhang et al. [2021], which relies on the
learned dynamics model, we use our MICo loss. We kept all other components untouched (so a
dynamics and reward model were still being learned).

It is worth noting that some of the hyperparameters we used differ from those listed in the code
provided by Zhang et al. [2021]; in our experiments they hyperparameters for all agents are based on
the default SAC hyperparameters in the Dopamine library [Castro et al., 2018].
For the ALE experiments we used the “squaring” of the sampled batches introduced by Castro [2020]
(where all pairs of sampled states are considered). However, the implementation provided by Zhang
et al. [2021] instead created a copy of the sampled batch of transitions and shuffled them; we chose
to follow this setup for the SAC-based experiments. Thus, while in the ALE experiments we are
comparing m2 pairs of states (where m is the batch size) at each training step, in the SAC-based
experiments we are only comparing m pairs of states.
The aggregate results are displayed in Figure 1, and per-environment results in Figure 14 below.

D Additional experimental results

D.1 Additional state feature results

The results shown in Figure 4 are on the well-known four-rooms GridWorld [Sutton et al., 1999]. We
provide extra experiments in Figure 5.

D.2 Complete ALE experiments

We additionally provide complete results for all the agents in Figure 8, Figure 9, Figure 10, Figure 11,
and Figure 12.

4
See https://github.com/google-research/google-research/tree/master/mico for all hyperparameter settings.

18
0.06
Metric 0.035
Metric
d d

Linear regression, avg. error

Linear regression, avg. error


0.05 U 0.030
U
PVF 0.025 PVF
0.04 RF RF
0.020
0.03
0.015
0.02
0.010
0.01 0.005
0.00 0.000
4 8 12 16 20 24 28 32 0 8 16 24 32 40 48 56 64
Number of features Number of features
Figure 5: Average error when performing linear regression on varying numbers of features on the
mirrored rooms introduced by Castro [2020] (left) and the grid task introduced by Dayan [1993]
(right). Averaged over 10 runs; shaded areas represent 95% confidence intervals.

175

Human normalized median scores


500
Human normalized mean scores

150
400 125

300 100
Agent Agent
DQN 75 DQN
200 Rainbow Rainbow
QR-DQN QR-DQN
IQN 50 IQN
100 M-IQN M-IQN
Loss 25 Loss
Regular Regular
0 + MICo 0 + MICo
0 25 50 75 100 125 150 175 200 0 25 50 75 100 125 150 175 200
Frames (M) Frames (M)

Figure 6: Mean (left) and median (right) human normalized scores across 60 Atari 2600 games,
averaged over 5 independent runs.
D.3 Sweep over α and β values

In Figure 13 we demonstrate the performance of the MICo loss when added to Rainbow over a
number of different values of α and β. For each agent, we ran a similar hyperparameter sweep over
α and β on the same six games displayed in Figure 13 to determine settings to be used in the full
ALE experiments.

D.4 Complete DM-Control results

Full per-environment results are provided in Figure 14.

D.5 Compute time and infrastructure

For Figure 4 each run took approximately 10 minutes. For Figure 4 and Figure 5 the running time
varied for each environment and per metric but a conservative estimate is 30 minutes per run. All
GPU experiments were run on NVIDIA Tesla P100 GPUs. Each Atari game takes approximately 5
days (300 hours) to run for 200M frames.

19
Percentage improvement Percentage improvement Percentage improvement Percentage improvement Percentage improvement

102
101
100

102
101
100
100
101
102

100
101
102
103
0

102
101
100
102
101
100
103
102
101
100

100
101
102
100
101
102
100
101
102
103

0
0
0

Venture Venture PrivateEye MontezumaRevenge Skiing


Solaris PrivateEye Bowling PrivateEye Venture
ElevatorAction Jamesbond Skiing Bowling ElevatorAction
Jamesbond Bowling UpNDown ChopperCommand Bowling
AirRaid Tennis Tennis Amidar Tennis
Hero ElevatorAction VideoPinball Centipede PrivateEye
Pitfall Seaquest Kangaroo BattleZone Alien
Carnival Centipede DemonAttack Krull Kangaroo
Amidar BattleZone Breakout VideoPinball Berzerk
BankHeist Gravitar Assault Kangaroo Krull
Bowling Pitfall ElevatorAction Qbert Qbert
Berzerk Amidar BankHeist Hero JourneyEscape

over 5 independent runs.


FishingDerby MsPacman StarGunner ElevatorAction BankHeist
Assault Carnival Carnival DoubleDunk Atlantis
MsPacman BeamRider Centipede Pong CrazyClimber
DoubleDunk StarGunner Freeway AirRaid Carnival
Enduro Freeway AirRaid JourneyEscape VideoPinball
Boxing Pooyan NameThisGame KungFuMaster Boxing
Qbert AirRaid KungFuMaster Freeway MontezumaRevenge
Pong Alien Boxing Alien Pong
Tutankham Enduro Pong Boxing BeamRider
Frostbite Kangaroo BattleZone Tennis Amidar
PrivateEye Skiing Krull RoadRunner RoadRunner
Freeway NameThisGame CrazyClimber FishingDerby Robotank
Tennis Frostbite Frostbite Carnival Riverraid
Krull Boxing JourneyEscape CrazyClimber AirRaid
MontezumaRevenge MontezumaRevenge Riverraid BankHeist TimePilot
RoadRunner FishingDerby DoubleDunk Tutankham BattleZone
CrazyClimber BankHeist Atlantis Enduro KungFuMaster

20
WizardOfWor Pong Robotank Gravitar Asterix
JourneyEscape RoadRunner Enduro Venture Zaxxon
FishingDerby Robotank
Atlantis CrazyClimber Pooyan BeamRider Gravitar
Skiing Krull BeamRider Riverraid DoubleDunk
Robotank KungFuMaster Phoenix Pitfall Pooyan
Breakout DoubleDunk Berzerk Gopher Enduro
Riverraid Riverraid RoadRunner Assault Gopher
NameThisGame JourneyEscape Gopher Atlantis Assault
Pooyan Tutankham MsPacman MsPacman Breakout
IceHockey Robotank Jamesbond Pooyan MsPacman
YarsRevenge Hero Qbert Jamesbond FishingDerby
StarGunner UpNDown Pitfall NameThisGame Tutankham
Alien Atlantis Hero UpNDown YarsRevenge
BeamRider YarsRevenge YarsRevenge Frostbite Hero
Zaxxon Assault Alien TimePilot StarGunner
BattleZone Gopher Amidar Skiing UpNDown
Gopher SpaceInvaders SpaceInvaders StarGunner DemonAttack
Asterix Breakout Asteroids WizardOfWor SpaceInvaders
VideoPinball Berzerk Seaquest YarsRevenge IceHockey
Kangaroo Zaxxon Gravitar IceHockey Freeway
KungFuMaster TimePilot Asterix Breakout Phoenix
Gravitar DemonAttack Zaxxon Zaxxon Frostbite
TimePilot Qbert Venture Berzerk NameThisGame
Centipede IceHockey MontezumaRevenge Asterix Jamesbond
UpNDown Solaris IceHockey Asteroids Asteroids
Asteroids Asterix Tutankham SpaceInvaders Centipede
DemonAttack VideoPinball TimePilot DemonAttack ChopperCommand
Seaquest WizardOfWor Solaris Solaris WizardOfWor
ChopperCommand Asteroids ChopperCommand Seaquest Seaquest
Human normalized IQN + MICo improvement over IQN (7.88 avg. improvement, 33/60 games improved)

Phoenix ChopperCommand WizardOfWor Phoenix Solaris


Human normalized DQN + MICo improvement over DQN (26.51 avg. improvement, 41/60 games improved)

Human normalized M-IQN + MICo improvement over M-IQN (12.73 avg. improvement, 33/60 games improved)
SpaceInvaders Phoenix Pitfall
Human normalized Rainbow + MICo improvement over Rainbow (30.73 avg. improvement, 41/60 games improved)

Human normalized QR-DQN + MICo improvement over QR-DQN (71.81 avg. improvement, 40/60 games improved)

when adding LMICo to DQN, Rainbow, QR-DQN, IQN, and M-DQN. The results for are averaged
Figure 7: From top to bottom, percentage improvement in returns (averaged over the last 5 iterations)
12000
AirRaid 4000
Alien Amidar Assault Asterix 1300
Asteroids
1000 2500 17500
10000 3500 1200
800 15000
3000 2000 1100
8000 12500
2500

Return
600 1000
6000 1500 10000
2000
400 7500 900
4000 1500 1000
DQN 1000 200
5000 800
2000
DQN + MICo 500
500 2500
700
0 0 0
0

1.0 1e6
Atlantis
50 100 150 200 0
BankHeist
50 100 150 200

30000
0
BattleZone
50 100 150 200 0
BeamRider
50 100 150 200

900
0
Berzerk
50 100 150 200 0
Bowling
50 100 150 200

700
6000 800 45
0.8 600 25000
5000 700 40
500
0.6 20000 600
Return

4000 35
400
15000 500
0.4 300 3000
30
400
0.2
200 10000 2000
DQN 300 25
100
5000 1000 DQN + MICo 200
20
0.0 0
0
Boxing
50 100 150 200 0
Breakout
50 100 150 200 0
Carnival
50 100 150 200

5000
0
Centipede
50 100 150 200
ChopperCommand CrazyClimber
2250
0 50 100 150 200

120000
0 50 100 150 200

200 5000 4500 2000


80 100000
4000 1750
60 4000
150 80000
1500
Return

3500
40 3000 1250
100 3000 60000
20 2000 1000
2500 40000
50
750
0 1000 2000
20000
0 500
1500

DemonAttack
0 50 100 150 200 0
DoubleDunk
50 100 150 200

70000
ElevatorAction
0 50 100 150 200

2000
0
Enduro
50 100 150 200

40
0
FishingDerby
50 100 150 200

35
0
Freeway
50 100 150 200

10
10000 60000 1750 30
20
5
8000 50000 1500 25
0
0 1250
40000
Return

20
6000 20
5 1000
30000 15
750 40
4000 10
20000 10
500 60
2000 15
10000 250 5
80
20 0 0 0
0
0
Frostbite
50 100 150 200

16000
0
Gopher
50 100 150 200

1000
0
Gravitar
50 100 150 200 0 50
Hero
100 150 200 0
IceHockey
50 100 150 200 0
Jamesbond
50 100 150 200

6000 2 1750
14000 25000
800 4 1500
5000 12000
20000 6
10000 1250
4000
Return

600 8
8000 15000 1000
3000 10
6000 400 750
10000 12
2000
4000 500
14
1000 200 5000
2000 250
16
0 0 0 0

JourneyEscape
0 50 100 150 200

12000
0
Kangaroo
50 100 150 200

8000
0 50
Krull
100 150 200

25000
KungFuMaster MontezumaRevenge MsPacman
0 50 100 150 200

1.75
0 50 100 150 200 0 50 100 150 200

2000 4000
10000 7000 1.50
4000 20000 3500
1.25
6000 8000 6000 3000
15000
Return

8000 1.00
6000 5000 2500
10000 0.75 2000
4000 10000
12000 4000
0.50 1500
14000 2000 3000 5000 0.25 1000
16000
0 2000 0.00 500
18000 0

10000
NameThisGame
0 50 100 150 200

17500
0
Phoenix
50 100 150 200

0
0 50
Pitfall
100 150 200

20
0 50
Pong
100 150 200 0
Pooyan
50 100 150 200

1750
0
PrivateEye
50 100 150 200

4500
9000 15000 50 15 1500
4000
8000 100 10 1250
12500 3500
7000 150 5 1000
3000
Return

10000
6000 200 0 750
2500
7500 250
5000 5 2000 500
4000 5000 300
350
10
DQN 1500 250
3000
2000
2500
400
15
20
DQN + MICo 1000 0
0 500 250
0 50
Qbert
100 150 200

14000
0
Riverraid
50 100 150 200

50000
0
RoadRunner
50 100 150 200

70
0
Robotank
50 100 150 200 0
Seaquest
50 100 150 200

16000
0
Skiing
50 100 150 200

12000 12000
60 18000
12000 40000
10000 10000 20000
10000 50
8000 30000 8000 22000
Return

40
8000 24000
6000 6000
20000 30
6000 26000
4000 4000
20 28000
4000 10000
2000 2000
10 30000
2000
0 0 0 32000
0
Solaris
50 100 150 200

7000
SpaceInvaders
0 50 100 150 200 0
StarGunner
50 100 150 200 0
Tennis
50 100 150 200

9000
0
TimePilot
50 100 150 200

225
0
Tutankham
50 100 150 200

2000
6000
DQN 60000 20
8000 200
1800
5000
DQN + MICo 50000
10 7000 175
1600 40000 150
6000
Return

4000
1400 30000 0 125
5000
3000 100
1200 20000 4000
2000 10
75
1000 3000
1000 10000 50
20 2000
800
0 0 25
0
UpNDown
50 100 150 200

1400
0
Venture
50 100 150 200 0
VideoPinball
50 100 150 200 0
WizardOfWor
50 100 150 200 0
YarsRevenge
50 100 150 200

12000
0
Zaxxon
50 100 150 200

14000 60000
400000 8000
1200 10000
12000 50000
1000
10000 300000 6000 8000
40000
Return

800
8000 6000
600 200000 4000 30000
6000 4000
400 20000
4000 100000 2000
200 2000
2000 10000
0 0 0 0
0
0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200
Frames (x1M) Frames (x1M) Frames (x1M) Frames (x1M) Frames (x1M) Frames (x1M)

Figure 8: Training curves for DQN agents. The results for all games and agents are over 5 independent
runs, and shaded regions report 75% confidence intervals.

21
AirRaid 4000
Alien Amidar Assault Asterix Asteroids
14000 40000 2500
3500 2500 4000
12000 3000
2000 2000
10000 3000 30000
2500

Return
8000 1500 1500
2000 20000
6000 2000
1500 1000 1000
4000 Rainbow 1000 500 1000
10000
2000 Rainbow + MICo 500
500
0 0 0
0
Atlantis
50 100 150 200

1200
0
BankHeist
50 100 150 200

45000
0
BattleZone
50 100 150 200 0
BeamRider
50 100 150 200 0
Berzerk
50 100 150 200

55
0
Bowling
50 100 150 200

40000 7000 1750


800000 50
1000 35000 6000 1500
600000 800 30000 5000 45
1250
Return

25000 4000 40
600 1000
400000 20000
3000 750 35
400 15000
200000
200 10000
2000 Rainbow 500 30

5000
1000 Rainbow + MICo 250 25
0 0

100
0
Boxing
50 100 150 200

200
0
Breakout
50 100 150 200 0
Carnival
50 100 150 200

7000
0
Centipede
50 100 150 200
ChopperCommand CrazyClimber
14000
0 50 100 150 200

160000
0 50 100 150 200

5000 12000 140000


80 175
6000
150 4000 10000 120000
60
125 5000 100000
Return

40 8000
100 3000
80000
20 6000
75 2000 4000 60000
0 50 4000
40000
25 1000 3000 2000
20 20000
0 0

60000
DemonAttack
0 50 100 150 200 0
DoubleDunk
50 100 150 200
ElevatorAction
0 50 100 150 200 0
Enduro
50 100 150 200 0
FishingDerby
50 100 150 200

35
0
Freeway
50 100 150 200

20 80000 40
2000 30
50000 20
10 60000 25
40000 1500 0
Return

20
30000 0 20
40000 15
1000
40
20000 10 10
20000 500 60
10000 5
20 80
0 0 0 0
0
Frostbite
50 100 150 200 0
Gopher
50 100 150 200

1400
0
Gravitar
50 100 150 200

50000
0 50
Hero
100 150 200 0
IceHockey
50 100 150 200

1600
0
Jamesbond
50 100 150 200

12000 5
10000 1200 1400
10000 40000
8000 1000 0 1200
8000 30000 1000
Return

800
6000
6000 5 800
600 20000
4000 600
4000 400 10 400
2000 10000
2000 200 200
15
0 0 0 0 0

0
JourneyEscape
0 50 100 150 200

14000
0
Kangaroo
50 100 150 200

7000
0 50
Krull
100 150 200

30000
KungFuMaster MontezumaRevenge MsPacman
0 50 100 150 200

1000
0 50 100 150 200

5000
0 50 100 150 200

2500 12000
25000 800 4000
5000 6000
10000
7500 20000
600
Return

8000 3000
5000
10000 15000
6000 400
12500 10000 2000
4000 4000
15000 200
2000 5000 1000
17500 3000
0 0 0

NameThisGame
12000
0 50 100 150 200

50000
0
Phoenix
50 100 150 200

0
0 50
Pitfall
100 150 200

20
0 50
Pong
100 150 200 0
Pooyan
50 100 150 200

30000
0
PrivateEye
50 100 150 200

6000
20 25000
10000 40000
10 5000
40 20000
8000 30000
Rainbow 4000
Return

60
0 15000
6000 20000 80 Rainbow + MICo 3000
10000
100 10 2000
4000 10000
120 5000
1000
2000 0 140 20 0

20000
0 50
Qbert
100 150 200 0
Riverraid
50 100 150 200 0
RoadRunner
50 100 150 200 0
Robotank
50 100 150 200

80000
0
Seaquest
50 100 150 200

16000
0
Skiing
50 100 150 200

60000 70
17500 20000 70000 18000
50000 60
15000 60000 20000
15000 40000 50
12500 50000 22000
Return

10000 30000 40 40000 24000


10000 30
7500 30000 26000
20000
5000 20 20000 28000
5000 10000
2500 10 10000
30000
0 0 0 0
0
0
Solaris
50 100 150 200

14000
SpaceInvaders
0 50 100 150 200

80000
0
StarGunner
50 100 150 200

0
0
Tennis
50 100 150 200 0
TimePilot
50 100 150 200 0
Tutankham
50 100 150 200

3500 Rainbow 70000 14000 250


3000
12000
10000
Rainbow + MICo 60000 5 12000
200
2500 50000 10000
Return

8000 10
2000 40000 8000 150
6000 30000
1500 15 6000
100
4000 20000
1000 4000
2000 20 50
10000 2000
500
0 0 0
25
0
UpNDown
50 100 150 200

1750
0
Venture
50 100 150 200

600000
0
VideoPinball
50 100 150 200

12000
0
WizardOfWor
50 100 150 200

70000
0
YarsRevenge
50 100 150 200 0
Zaxxon
50 100 150 200

50000 25000
1500 500000 10000 60000
40000 20000
1250 50000
400000 8000
Return

30000 1000 40000 15000


300000 6000
750 30000
20000 10000
200000 4000
500 20000
10000 100000 2000 5000
250
10000
0 0 0 0 0
0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200
Frames (x1M) Frames (x1M) Frames (x1M) Frames (x1M) Frames (x1M) Frames (x1M)

Figure 9: Training curves for Rainbow agents. The results for all games and agents are over 5
independent runs, and shaded regions report 75% confidence intervals.

22
17500
AirRaid Alien Amidar Assault Asterix 1800
Asteroids
5000 1750
15000 5000
1500 8000 1600
12500 4000 4000
1250
6000 1400

Return
10000 1000
3000 3000
1200
7500 750 4000
2000 2000 1000
5000
QR-DQN 500
2000
2500
QR-DQN + MICo 1000 250 1000 800
0 0 0

1.0
0
1e6 Atlantis
50 100 150 200 0
BankHeist
50 100 150 200 0
BattleZone
50 100 150 200 0
BeamRider
50 100 150 200 0
Berzerk
50 100 150 200 0
Bowling
50 100 150 200

40000 8000 700 55


1000
0.8 35000 7000
50
800 6000 600
30000
0.6 45
Return

5000
600 25000 500
4000 40
0.4 20000
400 3000 400 35
0.2
15000
2000 QR-DQN 30
200 10000 1000 QR-DQN + MICo 300
25
0.0 0 5000 0

100
0
Boxing
50 100 150 200 0
Breakout
50 100 150 200 0
Carnival
50 100 150 200 0
Centipede
50 100 150 200
ChopperCommand CrazyClimber
2200
0 50 100 150 200

120000
0 50 100 150 200

70 4000
5000 2000
80 60 100000
3500 1800
50 4000
60 1600 80000
Return

40 3000
3000 1400
40 60000
30 1200
20 2000 2500 40000
20 1000
10 1000 800
0 2000 20000
600

DemonAttack
0 50 100 150 200

20
0
DoubleDunk
50 100 150 200
ElevatorAction
0 50 100 150 200 0
Enduro
50 100 150 200

60
0
FishingDerby
50 100 150 200

35
0
Freeway
50 100 150 200

12000 15 70000
2000 40 30
10000 10 60000
20 25
8000 5 50000 1500 0
Return

0 40000 20
6000 20
5 30000 1000 15
40
4000 10 20000 10
500 60
2000 15 10000
80 5
20 0 0
0
0
Frostbite
50 100 150 200 0
Gopher
50 100 150 200 0
Gravitar
50 100 150 200 0 50
Hero
100 150 200 0
IceHockey
50 100 150 200

1750
0
Jamesbond
50 100 150 200

1750 35000 2
8000 5000
1500 30000 1500
4
6000 4000 25000 1250
1250 6
Return

20000 1000
1000 8
4000 3000
15000 750
750 10
2000 10000 500
2000 500 12
5000 14 250
1000 250
0 16 0

JourneyEscape
0 50 100 150 200 0
Kangaroo
50 100 150 200

9000
0 50
Krull
100 150 200
KungFuMaster MontezumaRevenge MsPacman
0 50 100 150 200 0 50 100 150 200 0 50 100 150 200

1750
2000 8000 8000 25000
1500 4000
4000 7000 20000 1250
6000
Return

6000 1000 3000


6000
15000
4000 5000 750
8000 2000
10000
4000 500
10000 2000
5000 250 1000
3000
12000
0 0
0

8000
NameThisGame
0 50 100 150 200 0
Phoenix
50 100 150 200

0
0 50
Pitfall
100 150 200

20
0 50
Pong
100 150 200 0
Pooyan
50 100 150 200 0
PrivateEye
50 100 150 200

5000
5000 15 2000
7000 20
10 4000
6000 4000 40 1500
5
QR-DQN
Return

0 3000
5000 3000 60
5
QR-DQN + MICo 1000

4000 80 2000
2000 10 500
3000 100 15
1000 0
1000 20

35000
0 50
Qbert
100 150 200 0
Riverraid
50 100 150 200 0
RoadRunner
50 100 150 200 0
Robotank
50 100 150 200

8000
0
Seaquest
50 100 150 200 0
Skiing
50 100 150 200

14000 50000 60 12500


7000
30000 15000
12000 50 6000
40000 17500
25000
10000 40 5000
Return

20000 30000 20000


8000 4000
15000 30 22500
20000 3000
6000 25000
10000 20 2000
4000 10000 27500
5000 10 1000 30000
0 2000 0 0

3500
0
Solaris
50 100 150 200

1600
SpaceInvaders
0 50 100 150 200

60000
0
StarGunner
50 100 150 200 0
Tennis
50 100 150 200

10000
0
TimePilot
50 100 150 200 0
Tutankham
50 100 150 200

20
9000
3000 1400 50000 200
10 8000
2500 1200
40000 7000 150
Return

1000
2000 30000 0 6000
800
5000 100
1500 600 20000 10
400
QR-DQN 10000
4000
50
1000
200
QR-DQN + MICo 20 3000
500 0
0
UpNDown
50 100 150 200

1600
0
Venture
50 100 150 200 0
VideoPinball
50 100 150 200

8000
0
WizardOfWor
50 100 150 200 0
YarsRevenge
50 100 150 200

14000
0
Zaxxon
50 100 150 200

140000 400000
1400 7000 100000 12000
120000
1200 6000
100000 300000 80000 10000
1000 5000
Return

80000 800 8000


4000 60000
200000
60000 600 6000
3000 40000
40000 400 4000
100000 2000
20000 200 20000 2000
1000
0 0 0 0
0 0
0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200
Frames (x1M) Frames (x1M) Frames (x1M) Frames (x1M) Frames (x1M) Frames (x1M)

Figure 10: Training curves for QR-DQN agents. The results for all games and agents are over 5
independent runs, and shaded regions report 75% confidence intervals.

23
AirRaid Alien Amidar Assault Asterix 2500
Asteroids
17500 7000 2000 2250
15000 5000 20000
6000 2000
12500 5000 1500 4000 1750
15000

Return
10000 4000 1500
1000 3000
7500 3000 10000 1250
2000 1000
5000
IQN 2000 500 5000 750
2500 IQN + MICo 1000 1000
500
0 0 0 0
0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200
Atlantis BankHeist BattleZone BeamRider Berzerk Bowling
8000 40
800000 1000 40000 800
7000
700 35
800 6000
600000 30000
5000 600 30
Return

600
400000 4000 500
20000 25
400 3000
200000
200 10000
2000 IQN 400
20
1000 IQN + MICo 300
15
0 0 200
0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200

100
Boxing Breakout Carnival 5000
Centipede ChopperCommand CrazyClimber
5000 5000 140000
200 4500
80 120000
4000 4000
4000 100000
60 150
Return

3000 3500 3000 80000


40 100
3000 60000
2000 2000
20 50 2500 40000
1000 1000
0 2000 20000
0
0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200
DemonAttack DoubleDunk ElevatorAction Enduro 60
FishingDerby 35
Freeway
30000 20 80000
2000 40 30
25000
10 20 25
60000
20000 1500 0
Return

20
15000 0 20
40000
1000 15
10000 40
10 10
20000 500 60
5000
5
20 80
0 0 0
0
0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200
Frostbite Gopher 1600
Gravitar Hero IceHockey 3000
Jamesbond
5.0
17500 35000
8000 1400 2.5 2500
15000 30000
1200
0.0
6000 12500 25000 2000
1000
Return

2.5
10000 800 20000 1500
4000 5.0
7500 600 15000
1000
5000 7.5
2000 400 10000
10.0 500
2500 200 5000
0 0 0 12.5 0
0
0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200
JourneyEscape Kangaroo Krull 35000
KungFuMaster MontezumaRevenge MsPacman
14000 0.6
9000
2000 12000 30000 5000
8000 0.5
4000 10000 25000
7000 0.4 4000
6000
Return

8000 20000
6000 0.3 3000
8000 6000 15000
5000 0.2
10000 4000 10000 2000
4000
12000 2000 0.1
5000 1000
3000
14000 0 0 0.0
0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200
NameThisGame Phoenix Pitfall Pong Pooyan PrivateEye
0 20
12000
25000 15 5000 4000
10000 20 10
20000 4000
5 3000
Return

8000 15000 40
0 3000 2000
6000 10000 5
60
4000 5000
10
IQN 2000
1000
80
15
20
IQN + MICo 1000
0
2000 0
0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200

20000
Qbert Riverraid RoadRunner Robotank Seaquest 10000
Skiing
16000 30000
60000 70
17500 14000
50000 60 25000 15000
15000 12000
12500 40000 50 20000
Return

10000
10000 40 20000
8000 30000 15000
7500 30
6000 20000 10000 25000
5000 20
4000 10000 5000
2500 10
2000 30000
0 0 0
0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200
Solaris SpaceInvaders StarGunner Tennis 16000
TimePilot Tutankham
3000 80000 20
10000 14000 250
12000
2500 8000 60000 10
200
10000
Return

2000 6000 0
40000 8000 150
1500 4000 6000
10
2000
IQN 20000 4000 100
1000
IQN + MICo 20 2000
50
0 0
500
0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200
UpNDown 1600
Venture VideoPinball 12000
WizardOfWor 120000
YarsRevenge 17500
Zaxxon
100000
1400 500000 10000 100000 15000
80000 1200 12500
400000 8000 80000
60000 1000
Return

10000
800 300000 6000 60000
7500
40000 600 200000 4000 40000 5000
400
20000 100000
200 2000 20000 2500

0 0 0 0 0
0
0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200
Frames (x1M) Frames (x1M) Frames (x1M) Frames (x1M) Frames (x1M) Frames (x1M)

Figure 11: Training curves for IQN agents. The results for all games and agents are over 5 independent
runs, and shaded regions report 75% confidence intervals.

24
AirRaid Alien Amidar Assault Asterix 2500
Asteroids
20000 7000
6000 2000
17500 1750 40000
5000 6000
15000 2000
1500 5000
12500 4000 1250 30000

Return
4000 1500
10000 3000 1000
3000 20000
7500 750
2000 1000
5000 M-IQN 500 2000
10000
2500 M-IQN + MICo 1000 250 1000
500
0 0 0
0 0
0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200
Atlantis 1400
BankHeist BattleZone 12000
BeamRider 900
Berzerk 37.5
Bowling
1200 50000 35.0
800000 10000 800
32.5
1000 40000 700
600000 8000 30.0
Return

800 600
30000 6000 27.5
400000 600 500 25.0
400 20000 4000
200000 M-IQN 400 22.5
200 10000 2000
M-IQN + MICo 300 20.0
17.5
0 0 0 200
0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200
Boxing Breakout 6000
Carnival Centipede ChopperCommand CrazyClimber
100 300 5500
10000 140000
5000 5000
80 250 120000
4500 8000
60 200 4000 100000
Return

4000
6000 80000
150 3000 3500
40
100 3000 4000 60000
20 2000
2500 40000
50 2000
0 1000 2000 20000
0 0
0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200
DemonAttack DoubleDunk ElevatorAction Enduro 60
FishingDerby 35
Freeway
70000 20
80000 2000 40 30
60000
10 20 25
50000 60000 1500 0
Return

20
40000 0 20
40000 1000 15
30000
40
20000 10 10
20000 500 60
10000 5
20 80
0 0 0 0
0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200
Frostbite Gopher Gravitar Hero IceHockey Jamesbond
25000 2000 30000 2000
10000
1750 5 1750
20000 25000
8000 1500 1500
20000 0 1250
1250
Return

6000 15000
1000 15000 1000
10000 5
4000 750 750
10000
500 500
2000 5000 10
250 5000 250
0 0 0 15 0
0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200

0
JourneyEscape Kangaroo 10000
Krull KungFuMaster MontezumaRevenge MsPacman
14000 0.7 6000
2000 9000 40000
12000 0.6
8000 5000
4000 10000 0.5
7000 30000
4000
Return

6000 8000 0.4


6000
8000 6000 20000 0.3 3000
5000
10000 4000 0.2 2000
4000
10000
12000 2000 3000 0.1
1000
14000 0 2000 0 0.0
0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200

14000
NameThisGame Phoenix Pitfall Pong Pooyan PrivateEye
0 20
60000 250
12000 25 15 8000
200
50000 10
10000 50
5 6000 150
40000
Return

75
8000 0 100
30000 100
5 4000
6000 50
20000
4000
125 10
M-IQN 2000
0
10000 150 15
20
M-IQN + MICo 50
2000 0 175
0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200
Qbert Riverraid 70000
RoadRunner Robotank Seaquest Skiing
20000 16000 10000
60000 70 20000
17500 14000
60
15000 50000 15000
12000 15000
12500 50
40000
Return

10000
10000 40 20000
8000 30000 10000
7500 30
6000 20000
5000 20 5000 25000
4000 10000
2500 10
2000
0 0 0 30000
0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200
Solaris SpaceInvaders StarGunner Tennis TimePilot Tutankham
0 20000 250
4000
17500
M-IQN 70000
17500 225
15000
12500
M-IQN + MICo 60000 5
15000 200
50000
3000 175
12500
Return

10000 40000 10
10000 150
2000 7500 30000
15 7500 125
5000 20000 100
5000
1000 2500 10000 20 75
2500
0 0 50
0
0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200

300000
UpNDown Venture VideoPinball WizardOfWor YarsRevenge Zaxxon
600000 120000
12000
250000 800 20000
500000 10000 100000
200000 600 400000 80000 15000
8000
Return

150000 300000 60000


400 6000 10000
100000 200000 4000 40000
200 5000
50000 100000 2000 20000
0 0 0 0 0 0
0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200
Frames (x1M) Frames (x1M) Frames (x1M) Frames (x1M) Frames (x1M) Frames (x1M)

Figure 12: Training curves for M-IQN agents. The results for all games and agents are over 5
independent runs, and shaded regions report 75% confidence intervals.

25
Asterix Breakout Frostbite
Alpha 200
40000 0.01 175 10000
0.1
0.5 150 8000
30000
0.9 125
Beta
Return

100 6000
20000 0.1
1.0 75 4000
2.0 50
10000 10.0 2000
25
0 0 0
0 25 50 75 100 125 150 175 200 0 25 50 75 100 125 150 175 200 0 25 50 75 100 125 150 175 200

5000
MsPacman Seaquest YarsRevenge
80000 70000
70000 60000
4000
60000
50000
3000 50000
Return

40000 40000

2000 30000 30000


20000 20000
1000 10000 10000
0
0 25 50 75 100 125 150 175 200 0 25 50 75 100 125 150 175 200 0 25 50 75 100 125 150 175 200
Frames (M) Frames (M) Frames (M)

Figure 13: Sweeping over various values of α and β when adding the MICo loss to Rainbow. The
grey line represents regular Rainbow.

26
1000
Agent
cartpole_balance cartpole_balance_sparse
Agent Agent
cartpole_swingup
1000
SAC SAC SAC
DBC DBC 800 DBC
900 DBC-Det DBC-Det DBC-Det
DBC+MICo DBC+MICo DBC+MICo
MICo 800 MICo 700 MICo
800
600
700
600
500
Return

Return

Return
600
400
500 400
300
400
200 200

300
100

200 0
0 50 100 150 200 250 300 350 400 0 50 100 150 200 250 300 350 400 0 50 100 150 200 250 300 350 400
Environment steps (x 1K) Environment steps (x 1K) Environment steps (x 1K)
cheetah_run Agent
finger_spin Agent
350
hopper_stand Agent
SAC 400 SAC SAC
300 DBC DBC DBC
DBC-Det DBC-Det 300 DBC-Det
DBC+MICo DBC+MICo DBC+MICo
MICo MICo MICo
250 250
300

200 200
Return

Return

Return
200
150 150

100 100
100

50 50

0 0
0
0 50 100 150 200 250 300 350 400 0 50 100 150 200 250 300 350 400 0 50 100 150 200 250 300 350 400
Environment steps (x 1K) Environment steps (x 1K) Environment steps (x 1K)

70
hopper_hop Agent
pendulum_swingup Agent Agent
point_mass_easy
SAC 700 SAC SAC
DBC DBC 800 DBC
60 DBC-Det DBC-Det DBC-Det
DBC+MICo 600 DBC+MICo DBC+MICo
MICo MICo MICo
50 500 600

40 400
Return

Return

Return

400
30 300

20 200
200
10 100

0 0 0
0 50 100 150 200 250 300 350 400 0 50 100 150 200 250 300 350 400 0 50 100 150 200 250 300 350 400
Environment steps (x 1K) Environment steps (x 1K) Environment steps (x 1K)
walker_stand Agent
walker_walk Agent
walker_run Agent
SAC SAC 160 SAC
DBC 200 DBC DBC
500 DBC-Det DBC-Det DBC-Det
DBC+MICo DBC+MICo 140 DBC+MICo
MICo 175 MICo MICo

120
400 150

100
Return

Return

Return

125
300 80
100

60
75
200
40
50

100 20
25

0 50 100 150 200 250 300 350 400 0 50 100 150 200 250 300 350 400 0 50 100 150 200 250 300 350 400
Environment steps (x 1K) Environment steps (x 1K) Environment steps (x 1K)

Figure 14: Comparison of all agents on twelve of the DM-Control suite. Each algorithm and
environment was run for five independent seeds, and the shaded areas report 75% confidence
intervals.

27

You might also like