You are on page 1of 19

An Optimistic Perspective on Offline Reinforcement Learning

Rishabh Agarwal 1 * Dale Schuurmans 1 2 Mohammad Norouzi 1

Abstract Offline RL (Lange et al., 2012) concerns the problem of


Off-policy reinforcement learning (RL) using a learning a policy from a fixed dataset of trajectories, with-
fixed offline dataset of logged interactions is an out any further interactions with the environment. This
arXiv:1907.04543v3 [cs.LG] 6 Mar 2020

important consideration in real world applications. setting can leverage the vast amount of existing logged in-
This paper studies offline RL using the DQN re- teractions for real world decision-making problems such as
play dataset comprising the entire replay experi- robotics (Cabi et al., 2019; Dasari et al., 2019), autonomous
ence of a DQN agent on 60 Atari 2600 games. driving (Yu et al., 2018), recommendation systems (Strehl
We demonstrate that recent off-policy deep RL al- et al., 2010; Bottou et al., 2013), and healthcare (Shortreed
gorithms, even when trained solely on this replay et al., 2011). The effective use of such datasets would not
dataset, outperform the fully trained DQN agent. only make real-world RL more practical, but would also
To enhance generalization in the offline setting, enable better generalization by incorporating diverse prior
we present Random Ensemble Mixture (REM), a experiences.
robust Q-learning algorithm that enforces optimal In offline RL, an agent does not receive any new corrective
Bellman consistency on random convex combi- feedback from the online environment and needs to gen-
nations of multiple Q-value estimates. Offline eralize from a fixed dataset of interactions to new online
REM trained on the DQN replay dataset surpasses interactions during evaluation. In principle, off-policy algo-
strong RL baselines. The results here present an rithms can learn from data collected by any policy, however,
optimistic view that robust RL algorithms trained recent work (Fujimoto et al., 2019b; Kumar et al., 2019; Wu
on sufficiently large and diverse offline datasets et al., 2019; Siegel et al., 2020) presents a discouraging view
can lead to high quality policies. To enable of- that standard off-policy deep RL algorithms diverge or oth-
fline optimization of RL algorithms on a common erwise yield poor performance in the offline setting. Such
ground and reproduce our results, the DQN replay papers propose remedies by regularizing the learned policy
dataset is released at offline-rl.github.io. to stay close to the training dataset of offline trajectories.
Furthermore, Zhang and Sutton (2017) assert that a large
replay buffer can even hurt the performance of off-policy
1 Introduction algorithms due to its “off-policyness”.
One of the main reasons behind the success of deep learn- By contrast, this paper presents an optimistic perspective on
ing (LeCun et al., 2015) is the availability of large and offline RL that with sufficiently large and diverse datasets,
diverse datasets such as ImageNet (Deng et al., 2009) to robust RL algorithms, without an explicit correction for
train expressive deep neural networks. By contrast, most distribution mismatch, can result in high quality policies.
reinforcement learning (RL) algorithms (Sutton and Barto, The contributions of this paper can be summarized as:
2018) assume that an agent interacts with an online en-
vironment or simulator and learns from its own collected • An offline RL setup is proposed for evaluating algo-
experience. This limits online RL’s applicability to complex rithms on Atari 2600 games (Bellemare et al., 2013),
real world problems, where active data collection means based on the logged replay data of a DQN agent (Mnih
gathering large amounts of diverse experiences from scratch et al., 2015) comprising 50 million (observation, action,
per experiment, which can be expensive, unsafe, or require a reward, next observation) tuples per game. This setup re-
high-fidelity simulator that is often difficult to build (Dulac- duces the computation cost of the experiments consider-
Arnold et al., 2019). ably and helps improve reproducibility by standardizing
training using a fixed dataset. The DQN replay dataset
Work done as part of the Google AI Residency Program. and our code1 is released to enable offline optimization
1
Google Research, Brain Team 2 University of Alberta. Corre-
spondence to: Rishabh Agarwal <rishabhagarwal@google.com>, of RL algorithms on a common ground.
Mohammad Norouzi <mnorouzi@google.com>. 1
Refer to github.com/google-research/batch_rl for our code.
An Optimistic Perspective on Offline Reinforcement Learning

# Games Superior to DQN


Median Normalized Scores
Offline REM
119.0% Online C51 60 Offline QR-DQN
Offline DQN (Nature)
100.0% Online DQN
44 Online C51

30
50.0%
Offline REM
Offline QR-DQN
Offline DQN (Nature)
0.0% 0
0 200 400 600 800 1000 0 200 400 600 800 1000
Iteration Iteration
(a) (b)
Figure 1: Offline RL on Atari. (a) Median normalized evaluation scores averaged over 5 runs (shown as traces) across stochastic version
of 60 Atari 2600 games of offline agents trained using the DQN replay dataset. (b) Number of games where an offline agent achieves a
higher score than fully-trained DQN (Nature) as a function of training iterations. Each iteration corresponds to 1 million training frames.
Offline REM outperforms offline QR-DQN and DQN (Nature). The comparison with online C51 gives a sense to the reader about the
magnitude of the improvement from offline agents over the best policy in the entire DQN replay dataset.

• Contrary to recent work, we show that recent off-policy 1957) characterize the optimal policy in terms of the optimal

RL algorithms trained solely on offline data can be suc- Q-values, denoted Q∗ = Qπ , via:
cessful. For instance, offline QR-DQN (Dabney et al.,
2018) trained on the DQN replay dataset outperforms the Q∗ (s, a) = E R(s, a) + γEs0 ∼P max
0
Q∗ (s0 , a0 ) . (2)
a ∈A
best policy in the DQN replay dataset. This discrepancy
could be attributed to the differences in offline dataset To learn a policy from interaction with the environment, Q-
size and composition as well as choice of RL algorithm. learning (Watkins and Dayan, 1992) iteratively improves an
approximate estimate of Q∗ , denoted Qθ , by repeatedly re-
• A robust Q-learning algorithm called Random Ensem- gressing the LHS of (2) to target values defined by samples
ble Mixture (REM) is presented, which enforces optimal from the RHS of (2). For large and complex state spaces,
Bellman consistency on random convex combinations of approximate Q-values are obtained using a neural network
multiple Q-value estimates. Offline REM shows strong as the function approximator. To further stabilize optimiza-
generalization performance in the offline setting, and tion, a target network Qθ0 with frozen parameters may be
outperforms offline QR-DQN. The comparison with on- used for computing the learning target (Mnih et al., 2013).
line C51 (Bellemare et al., 2017), a strong RL baseline The target network parameters θ0 are updated to the current
illustrates the relative size of the gains from exploitation Q-network parameters θ after a fixed number of time steps.
of the logged DQN data with REM.
DQN (Mnih et al., 2013; 2015) parameterizes Qθ with a con-
2 Off-policy Reinforcement Learning volutional neural network (LeCun et al., 1998) and uses Q-
learning with a target network while following an -greedy
An interactive environment in reinforcement learning (RL) policy with respect to Qθ for data collection. DQN min-
is typically described as a Markov decision process (MDP) imizes the temporal difference (TD) error ∆θ using the
(S, A, R, P, γ) (Puterman, 1994), with a state space S, an loss L(θ) on mini-batches of agent’s past experience tu-
action space A, a stochastic reward function R(s, a), tran- ples, (s, a, r, s0 ), sampled from an experience replay buffer
sition dynamics P (s0 |s, a) and a discount factor γ ∈ [0, 1). D (Lin, 1992) collected during training:
A stochastic policy π(· | s) maps each state s ∈ S to a
distribution (density) over actions. L(θ) = Es,a,r,s0 ∼D [`λ (∆θ (s, a, r, s0 ))] , (3)
0 0 0
∆θ (s, a, r, s ) = Qθ (s, a) − r − γ max Q (s , a ) θ0
For an agent following the policy π, the action-value func- 0 a
tion, denoted Qπ (s, a), is defined as the expectation of cu-
mulative discounted future rewards, i.e., where lλ is the Huber loss (Huber, 1964) given by
hX∞ i
Qπ (s, a) := E γ t R(st , at ) ,
(
(1) 1 2
u , if |u| ≤ λ
t=0 `λ (u) = 2 1
(4)
s0 = s, a0 = a, st ∼ P (· | st−1 , at−1 ), at ∼ π(· | st ). λ(|u| − 2 λ), otherwise.

The goal of RL is to find an optimal policy π ∗ that attains Q-learning is an off-policy algorithm (Sutton and Barto,

maximum expected return, for which Qπ (s, a) ≥ Qπ (s, a) 2018) since the learning target can be computed without any
for all π, s, a. The Bellman optimality equations (Bellman, consideration of how the experience was generated.
An Optimistic Perspective on Offline Reinforcement Learning

A family of recent off-policy deep RL algorithms, which data collection policy, i.e., when the policy being learned
serve as a strong baseline in this paper, include Distribu- takes a different action than the data collection policy, we
tional RL (Bellemare et al., 2017; Jaquette, 1973) methods. don’t know the reward it would have gotten. This paper
Such algorithms estimate a density over returns for each revisits offline RL and investigates whether off-policy deep
state-action pair, denoted Z π (s, a), instead of directly esti- RL agents trained solely on offline data can be successful
mating the mean Qπ (s, a). Accordingly, one can express a without correcting for distribution mismatch.
form of distributional Bellman optimality as
D 4 Developing Robust Offline RL Algorithms
Z ∗ (s, a) = r + γZ ∗ (s0 , argmaxa0 ∈A Q∗ (s0 , a0 )), (5)
where r ∼ R(s, a), s0 ∼ P (· | s, a). In an online RL setting, an agent can acquire on-policy
data from the environment, which ensures a virtuous cycle
D where the agent chooses actions that it thinks will lead to
and = denotes distributional equivalence and Q∗ (s0 , a0 ) is
estimated by taking an expectation with respect to Z ∗ (s0 , a0 ). high rewards and then receives feedback to correct its errors.
C51 (Bellemare et al., 2017) approximates Z ∗ (s, a) by us- Since it is not possible to collecting additional data in the
ing a categorical distribution over a set of pre-specified offline RL setting, it is necessary to reason about general-
anchor points, and distributional QR-DQN (Dabney et al., ization using the fixed dataset. We investigate whether one
2018) approximates the return density by using a uniform can one design robust RL algorithms with an emphasis on
mixture of K Dirac delta functions, i.e., improving generalization in the offline setting. Ensembling
is commonly used in supervised learning to improve gen-
eralization. In this paper, we study two deep Q-learning
K K algorithms, Ensemble DQN and REM, which adopt ensem-
1 X 1 X
Zθ (s, a) := δθi (s,a) , Qθ (s, a) = θi (s, a). bling, to improve stability.
K i=1 K i=1
4.1 Ensemble-DQN
QR-DQN outperforms C51 and DQN and obtains state-of-
the-art results on Atari 2600 games, among agents that do Ensemble-DQN is a simple extension of DQN that approx-
not exploit n-step updates (Sutton, 1988) and prioritized imates the Q-values via an ensemble of parameterized Q-
replay (Schaul et al., 2016). This paper avoids using n-step functions (Faußer and Schwenker, 2015; Osband et al.,
updates and prioritized replay to keep the empirical study 2016; Anschel et al., 2017). Each Q-value estimate, de-
simple and focused on deep Q-learning algorithms. noted Qkθ (s, a), is trained against its own target Qkθ0 (s, a),
similar to Bootstrapped-DQN (Osband et al., 2016). The Q-
3 Offline Reinforcement Learning functions are optimized using identical mini-batches in the
same order, starting from different parameter initializations.
Modern off-policy deep RL algorithms (as discussed above) The loss L(θ) takes the form,
perform remarkably well on common benchmarks such as
K
the Atari 2600 games (Bellemare et al., 2013) and contin- 1 X
Es,a,r,s0 ∼D `λ ∆kθ (s, a, r, s0 ) ,
 
uous control MuJoCo tasks (Todorov et al., 2012). Such L(θ) =
K
k=1
off-policy algorithms are considered “online” because they
(6)
alternate between optimizing a policy and using that policy
to collect more data. Typically, these algorithms keep a ∆kθ (s, a, r, s0 ) = Qkθ (s, a) −r−γ max Qkθ0 (s0 , a0 )
a0
sliding window of most recent experiences in a finite replay
buffer (Lin, 1992), throwing away stale data to incorporate where lλ is the Huber loss. While Bootstrapped-DQN uses
most fresh (on-policy) experiences. one of the Q-value estimates in each episode to improve
Offline RL, in contrast to online RL, describes the fully exploration, in the offline setting, we are only concerned
off-policy setting of learning using a fixed dataset of experi- with the ability of Ensemble-DQN to exploit better and use
ences, without any further interaction with the environment. the mean of the Q-value estimates for evaluation.
We advocate the use of offline RL to help isolate an RL al-
gorithm’s ability to exploit experience and generalize vs. its 4.2 Random Ensemble Mixture (REM)
ability to explore effectively. The offline RL setting removes
Increasing the number of models used for ensembling typi-
design choices related to the replay buffer and exploration;
cally improves the performance of supervised learning mod-
therefore, it is simpler to experiment with and reproduce
els (Shazeer et al., 2017). This raises the question whether
than the typical off-policy setting.
one can use an ensemble over an exponential number of
Offline RL is considered challenging due to the distribu- Q-estimates in a computationally efficient manner. Inspired
tion mismatch between the current policy and the offline by dropout (Srivastava et al., 2014), we propose Random
An Optimistic Perspective on Offline Reinforcement Learning

Figure 2: Neural network architectures for DQN, distributional QR-DQN and the proposed expected RL variants, i.e., Ensemble-DQN
and REM, with the same multi-head architecture as QR-DQN. The individual Q-heads share all of the neural network layers except the
final fully connected layer.

Ensemble Mixture for off-policy RL. tion P∆ has full support over the entire (K − 1)-simplex.
(b) Only a finite number of distinct Q-functions globally
Random Ensemble Mixture (REM) uses multiple param-
minimize the loss in (3). (c) Q∗ is defined in terms of the
eterized Q-functions to estimate the Q-values, similar to
MDP induced by the data distribution D. (d) Q∗ lies in the
Ensemble-DQN. The key insight behind REM is that one
family of our function approximation. Then, at the global
can think of a convex combination of multiple Q-value es-
minimum of L(θ) (7) for a multi-head Q-network:
timates as a Q-value estimate itself. This is especially true
at the fixed point, where all of the Q-value estimates have (i) Under assumptions (a) and (b), all the Q-heads repre-
converged to an identical Q-function. Using this insight, sent identical Q-functions.
we train a family of Q-function approximators defined by (ii) Under assumptions (a)–(d), the common global solu-
mixing probabilities on a (K − 1)-simplex. tion is Q∗ .
Specifically, for each mini-batch, we randomly draw a cate-
The proof of (ii) follows from (i) and the fact that (7) is
gorical distribution α, which defines a convex combination
lower bounded by the TD error attained by Q∗ . The proof
of the K estimates to approximate the optimal Q-function.
of part (i) can be found in the supplementary material.
This approximator is trained against its corresponding target
to minimize the TD error. The loss L(θ) takes the form,
5 Offline RL on Atari 2600 Games
0
L(θ) = Es,a,r,s0 ∼D Eα∼P∆ `λ ∆α
  
θ (s, a, r, s ) , (7)
X X To facilitate a study of offline RL, we train several instances
∆αθ = αk Qkθ (s, a) − r − γ max
0
αk Qkθ0 (s0 , a0 ) of DQN (Nature) agents (Mnih et al., 2015) on stochas-
a
k k tic version of 60 Atari 2600 games for 200 million frames
each, with a frame skip of 4 (standard protocol) and sticky
where P∆ represents a probability distribution over the stan-
actions enabled (Machado et al., 2018). Sticky actions in-
dard (K − 1)-simplex ∆K−1 = {α ∈ RK : α1 + α2 + · · · +
duce stochasticity as there’s some probability that the pre-
αK = 1, αk ≥ 0, k = 1, . . . , K}.
vious action taken would be repeated instead of the cur-
REM considers Q-learning as a constraint satisfaction prob- rent action taken by the agent. On each game, we train
lem based on Bellman optimality constraints (2) and L(θ) 5 different agents with random initialization, and store all
can be viewed as an infinite set of constraints corresponding of the tuples of (observation, action, reward, next obser-
to different mixture probability distributions. For action se- vation) encountered during training into 5 replay datasets
lection, we use the average ofPthe K value estimates as the of 50 million tuples each. In terms of number of images,
Q-function, i.e., Q(s, a) = k Qkθ (s, a)/K. REM is easy each replay dataset is approximately 3.5 times larger than
to implement and analyze (see Proposition 1), and can be ImageNet (Deng et al., 2009), a large-scale deep learning
viewed as a simple regularization technique for value-based dataset. Additionally, each replay dataset include samples
RL. In our experiments, we use a very simple distribution from all of the intermediate (diverse) policies seen during
P∆ : we first draw a set of K values i. i. d. from Uniform (0, the optimization of the online DQN (Nature) agent.
1) and normalize them to get a valid categorical
P distribution, Experiment Setup. The DQN replay dataset is used for
i.e., αk0 ∼ U (0, 1) followed by αk = αk0 / αi0 .
training off-policy RL agents, offline, without any interac-
Proposition 1. Consider the assumptions: (a) The distribu- tion with the environment during training. We use the hy-
An Optimistic Perspective on Offline Reinforcement Learning

Offline DQN Offline C51

A STERIX B REAKOUT P ONG Q*B ERT S EAQUEST


9000 400 20 14000 20000
8000 350 15 12000
7000 300 10 15000
Average Score

10000
6000 250 5
8000 10000
5000 200 0
4000 150 −5 6000 5000
3000 100 −10
4000
2000 50 −15 0
1000 0 −20 2000
0 −50 −25 0 −5000
0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200
Iteration Iteration Iteration Iteration Iteration

Figure 3: Offline C51 vs. DQN (Nature). Average online scores of C51 and DQN (Nature) agents trained offline on stochastic version
of 5 Atari 2600 games using the DQN replay dataset for the same number of gradient steps as the online DQN agent. The scores are
averaged over 5 runs (shown as traces) and smoothed over a sliding window of 5 iterations and error bands show standard deviation. The
horizontal line shows the performance of fully-trained DQN.
% Improvement (Log scale)

% Improvement (Log scale)


103 Offline DQN (Nature) 103 Offline QR-DQN
102 102
101 101
100 100
0 DQN 0 DQN
−100 −100
1 1
−10 −10
−102 Random −102 Random
−103 −103
P RIVATE E YE
A MIDAR
J AMES B OND
E NDURO
S TAR G UNNER
W IZARD O F W OR
D OUBLE D UNK
B ANK H EIST

M S. PAC -M AN
C HOPPER C OMMAND
G OPHER
Q*B ERT
T IME P ILOT
R IVER R AID

B EAM R IDER

K RULL
I CE H OCKEY
U P AND D OWN
P ONG
N AME T HIS G AME
P ITFALL !
A STERIX
J OURNEY E SCAPE
B ATTLE Z ONE
A IR R AID

M ONTEZUMA’ S R EVENGE
S OLARIS
D OUBLE D UNK

P ITFALL !
R IVER R AID

N AME T HIS G AME


U P AND D OWN

A IR R AID
W IZARD O F W OR

P RIVATE E YE
A STEROIDS
S OLARIS
E LEVATOR A CTION
S KIING
B ERZERK
K ANGAROO
Z AXXON
V IDEO P INBALL
S EAQUEST

H.E.R.O
M ONTEZUMA’ S R EVENGE
YARS ’ S R EVENGE
K UNG -F U M ASTER
C RAZY C LIMBER
G RAVITAR
T ENNIS
P OOYAN
A LIEN
A SSAULT
R OAD R UNNER
C ARNIVAL
A TLANTIS
S PACE I NVADERS

F ISHING D ERBY
C ENTIPEDE
B OXING
T UTANKHAM
F REEWAY
P HOENIX
R OBOTANK

F ROSTBITE
B OWLING
D EMON A TTACK
B REAKOUT
V ENTURE

K ANGAROO
S KIING
A MIDAR
J AMES B OND
H.E.R.O

C ENTIPEDE
C RAZY C LIMBER
A LIEN
C ARNIVAL
A TLANTIS
K RULL
P ONG
T ENNIS
M S. PAC -M AN
F REEWAY
B EAM R IDER
YARS ’ S R EVENGE
J OURNEY E SCAPE
F ISHING D ERBY
S TAR G UNNER
Q*B ERT
V IDEO P INBALL
R OBOTANK
K UNG -F U M ASTER
B OXING
R OAD R UNNER
B ANK H EIST
B OWLING
P HOENIX
B ATTLE Z ONE
T UTANKHAM
P OOYAN
I CE H OCKEY

F ROSTBITE
Z AXXON

A STERIX
V ENTURE
E NDURO
B REAKOUT
A STEROIDS
G OPHER
S EAQUEST
A SSAULT
G RAVITAR
T IME P ILOT
C HOPPER C OMMAND
S PACE I NVADERS
D EMON A TTACK
B ERZERK
E LEVATOR A CTION
Game Game

(a) (b)

Figure 4: Offline QR-DQN vs. DQN (Nature). Normalized performance improvement (in %) over fully-trained online DQN (Nature),
per game, of (a) offline DQN (Nature) and (b) offline QR-DQN trained using the DQN replay dataset for same number of gradient updates
as online DQN. The normalized online score for each game is 100% and 0% for online DQN and random agents respectively.

perparameters provided in the Dopamine baselines (Castro Figure 4 shows that offline DQN underperforms fully-
et al., 2018) for a standardized comparison (Section A.4) and trained online DQN on all except a few games where it
report game scores using a normalized scale (Section A.3). achieves much higher scores than online DQN with the
As the DQN replay dataset is collected using a diverse mix- same amount of data and gradient updates. Offline QR-
ture of policies, we compare the performance of offline DQN, on the other hand, outperforms offline DQN and
agents against the best performing policy in this mixture (i.e., online DQN on most of the games (refer to Figure A.4 for
fully-trained online DQN). The evaluation of the offline learning curves). Offline C51 trained using the DQN replay
agents is done online for a limited number of times (in the dataset also considerably improves upon offline DQN (Fig-
intervals of 1 million training frames), and we report the best ure 3). These results demonstrate that it is possible to op-
evaluation score for each agent, averaged over 5 runs (one timize strong Atari agents offline using standard deep RL
run per replay dataset) for each game. algorithms on DQN replay dataset without constraining the
learned policy to stay close to the training dataset of offline
5.1 Can standard off-policy RL algorithms with no trajectories. Furthermore, the disparity between the perfor-
environment interactions succeed? mance of offline QR-DQN/C51 and DQN (Nature) indicates
the difference in their ability to exploit off-policy data.
Given the logged replay data of the DQN agent, it is natural
to ask how well an offline variant of DQN solely trained
5.2 Asymptotic performance of offline RL agents
using this dataset would perform? Furthermore, whether
more recent off-policy algorithms are able to exploit the In supervised learning, asymptotic performance matters
DQN replay dataset more effectively than offline DQN. To more than performance within a fixed budget of gradient
investigate these questions, we train DQN (Nature) and QR- updates. Similarly, for a given sample complexity, we prefer
DQN agents, offline, on the DQN replay dataset for the RL algorithms that perform the best as long as the number
same number of gradient updates as online DQN. of gradient updates is feasible. Since the sample efficiency
for a offline dataset is fixed, we train offline agents for 5
An Optimistic Perspective on Offline Reinforcement Learning

Table 1: Asymptotic performance of offline agents. Median

Median Normalized Scores


normalized scores (averaged over 5 runs) across 60 stochastic Atari 135.0%
2600 games and number of games where an offline agent trained 119.0% Online C51
using the DQN replay dataset achieves better scores than a fully- 100.0% Online DQN
trained DQN (Nature) agent. Offline DQN (Adam) significantly
outperforms DQN (Nature) and needs further investigation. Offline
REM, by exploiting the logged DQN data effectively, surpasses 50.0%
the gains (119%) from fully-trained C51, a strong online agent. Online REM
Online QR-DQN
Online Bootstrapped-DQN
Offline agent Median > DQN 0.0%
0 50 100 150 200
DQN (Adam) 111.9% 41
Iteration
Ensemble-DQN 111.0% 39
Averaged Ensemble-DQN 112.1% 43 Figure 5: Online RL. Median normalized evaluation scores av-
QR-DQN 118.9% 45 eraged over 5 runs (shown as traces) across stochastic version
REM 123.8% 49 of 60 Atari 2600 games of online agents trained for 200 million
frames (standard protocol). Online REM with 4 Q-networks per-
forms comparably to online QR-DQN. Please refer to Figure A.6
times as many gradient updates as DQN. for learning curves.

Comparison with QR-DQN. QR-DQN modifies the


DQN (Nature) architecture to output K values for each 5.3 Does REM work in the online setting?
action using a multi-head Q-network and replaces RM- In online RL, learning and data generation are tightly cou-
SProp (Tieleman and Hinton, 2012) with Adam (Kingma pled, i.e., an agent that learns faster also collects more
and Ba, 2015) for optimization. To ensure a fair comparison relevant data. We ran online REM with K separate Q-
with QR-DQN, we use the same multi-head Q-network as networks (with K = 4 for computational efficiency) be-
QR-DQN with K = 200 heads (Figure 2), where each head cause of the better convergence speed over multi-head REM
represents a Q-value estimate for REM and Ensemble-DQN. in the offline setting. For data collection, we use -greedy
We also use Adam for optimization. with a randomly sampled Q-estimate from the simplex for
Additional Baselines. To isolate the gains due to Adam in each episode, similar to Bootstrapped DQN. We follow the
QR-DQN and our proposed variants, we compare against standard online RL protocol on Atari and use a fixed replay
a DQN baseline which uses Adam. We also evaluate Aver- buffer of 1M frames.
aged Ensemble-DQN, a variant of Ensemble-DQN proposed To estimate the gains from the REM objective (7) in the
by Anschel et al. (2017), which uses the average of the pre- online setting, we also evaluate Bootstrapped-DQN with
dicted target Q-values as the Bellman target for training identical modifications (e.g., separate Q-networks) as online
each parameterized Q-function. This baseline determines REM. Figure 5 show that REM performs on par with QR-
whether the random combinations of REM provide any sig- DQN and considerably outperforms Bootstrapped-DQN.
nificant benefit over simply using an ensemble of predictors This shows that we can use the insights gained from the
to stabilize the Bellman target. offline setting with appropriate design choices (e.g., explo-
Table 1 shows the comparison of baselines with REM and ration, replay buffer) to create effective online methods.
Ensemble-DQN. Surprisingly, DQN with Adam noticeably
bridges the gap in asymptotic performance between QR- 6 Important Factors in Offline RL
DQN and DQN (Nature) in the offline setting (see Figure 1
Dataset Size and Composition. Our offline learning re-
and Table 3 in appendix). Offline Ensemble-DQN does
sults indicate that 50 million tuples per game from DQN (Na-
not improve upon this strong DQN baseline showing that
ture) are sufficient to obtain good online performance on
its naive ensembling approach is inadequate. Furthermore,
most of the Atari 2600 games. We hypothesize that the
Averaged Ensemble-DQN performs only slightly better than
the size of the fixed replay and its composition (De Bruin
Ensemble-DQN. In contrast, REM exploits offline data more
et al., 2015) play a key role in the success of standard RL
effectively than other agents, including QR-DQN, when
algorithms trained offline on the DQN replay dataset.
trained for more gradient updates. The gains from REM over
Averaged Ensemble-DQN suggest that the effectiveness of To study the role of the replay dataset size, we perform an ab-
REM is due to the noise from randomly ensembling Q- lation experiment with variable replay size. We train offline
value estimates leading to more robust training, analogous QR-DQN and REM with reduced data obtained via ran-
to dropout. Consistent with this hypothesis, we also find that domly subsampling entire trajectories from the logged DQN
offline REM with separate Q-networks (with more variation experiences, thereby maintaining the same data distribution.
in individual Q-estimates) performs better asymptotically Figure 6 presents the performance of the offline REM and
and learns faster than a multi-head Q-network (Figure A.3). QR-DQN agents with N % of the tuples in the DQN replay
An Optimistic Perspective on Offline Reinforcement Learning

Online DQN Offline QR-DQN Offline REM

A STERIX B REAKOUT P ONG Q*B ERT S EAQUEST


200% 250% 120% 120% 300%
100% 100% 250%
Normalized Score

200%
150%
80% 80% 200%
150%
100% 60% 60% 150%
100%
40% 40% 100%
50%
50% 20% 20% 50%
0% 0% 0% 0% 0%
1% 10% 20% 50% 100% 1% 10% 20% 50% 100% 1% 10% 20% 50% 100% 1% 10% 20% 50% 100% 1% 10% 20% 50% 100%
Fraction of data Fraction of data Fraction of data Fraction of data Fraction of data

Figure 6: Effect of Offline Dataset Size. Normalized scores (averaged over 5 runs) of QR-DQN and multi-head REM trained offline on
stochastic version of 5 Atari 2600 games for 5X gradient steps using only a fraction of the entire DQN replay dataset (200M frames)
obtained via randomly subsampling trajectories. With only 10% of the entire replay dataset, REM and QR-DQN approximately recover
the performance of fully-trained DQN.

Offline QR-DQN Offline REM

A STERIX B REAKOUT P ONG Q*B ERT S EAQUEST


5000 120 20 6000 2000
100 15
4000 5000 1500
10
Average Score

80
5 4000
3000 60 1000
0
3000
40 −5
2000 500
−10 2000
20
1000 −15 0
0 1000
−20
0 −20 −25 0 −500
0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100
Iteration Iteration Iteration Iteration Iteration

Figure 7: Offline RL with Lower Quality Dataset. Normalized scores (averaged over 3 runs) of QR-DQN and multi-head REM trained
offline on stochastic version of 5 Atari 2600 games for 5X gradient steps using logged data from online DQN trained only for 20M
frames (20 iterations). The horizontal line shows the performance of best policy found during DQN training for 20M frames which is
significantly worse than fully-trained DQN. We observe qualitatively similar results to offline setting with entire replay dataset.

dataset where N ∈ {1, 10, 20, 50, 100}. As expected, per- continuous control MuJoCo tasks (Todorov et al., 2012)
formance tends to increase as the fraction of data increases. for 1 million time steps and store all of the experienced
With N ≥ 10%, REM and QR-DQN still perform compara- transitions. Using this dataset, we train standard off-policy
bly to online DQN on most of these games. However, the agents including TD3 and DDPG completely offline. Con-
performance deteriorates drastically for N = 1%. sistent with our offline results on Atari games, offline TD3
significantly outperforms the data collecting DDPG agent
To see the effect of quality of offline dataset, we perform
and offline DDPG (Figure 8). Offline TD3 also performs
another ablation where we train offline REM and QR-DQN
comparably to BCQ (Fujimoto et al., 2019b), an algorithm
on the first 20 million frames in the DQN replay dataset.
designed specifically to learn from offline data.
This reduced dataset roughly approximates exploration data
with suboptimal returns. Similar to the offline results with
the entire replay dataset, on most Atari games, offline REM 7 Related work and Discussion
and QR-DQN perform comparably and outperform the best Our work is mainly related to batch RL 2 (Lange et al., 2012).
policy amongst the mixture of policies which collected the Similar to (Ernst et al., 2005; Riedmiller, 2005; Jaques et al.,
first 20 million frames (Figure 7 and Figure A.7). 2019), we investigate batch off-policy RL, which requires
Algorithm Choice. In contrast to our offline results on Atari learning a good policy given a fixed dataset of interactions.
games with discrete actions (Section 5.1), Fujimoto et al. In our offline setup, we only assume access to samples
(2019b) find that standard off-policy RL agents are not ef- from the behavior policy and focus on Q-learning methods
fective on continuous control problems when trained offline, without any form of importance correction, as opposed to
even with large and diverse replay datasets. The results of (Swaminathan and Joachims, 2015; Liu et al., 2019).
Fujimoto et al. (2019b) are based on the evaluation of a stan- Recent work (Fujimoto et al., 2019b; Kumar et al., 2019;
dard continuous control agent, called DDPG (Lillicrap et al., Wu et al., 2019; Siegel et al., 2020) reports that standard
2015), and other more recent continuous control algorithms off-policy methods trained on fixed datasets fail on continu-
such as TD3 (Fujimoto et al., 2018) and SAC (Haarnoja ous control environments. Fujimoto et al. (2019a) also ob-
et al., 2018) are not considered in their study.
2
To avoid confusion with batch vs. minibatch optimization, we
Motivated by the so-called final buffer setting in Fujimoto refer to batch RL as offline RL in this paper.
et al. (2019b) (Section A.2), we train a DDPG agent on
An Optimistic Perspective on Offline Reinforcement Learning

Offline TD3 Offline DDPG Offline BCQ DDPG

HalfCheetah-v1 Hopper-v1 Walker2d-v1


4000
8000
3000
3000
Average Returns
6000
2000 2000
4000

2000 1000 1000

0 0 0

0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
Time steps (in 1e6) Time steps (in 1e6) Time steps (in 1e6)

Figure 8: Offline Continuous Control Experiments. We examine the performance of DDPG, TD3 and BCQ (Fujimoto et al., 2019b)
trained using identical offline data on three standard MuJoCo environments. We plot the mean performance (evaluated without exploration
noise) across 5 runs. The shaded area represents a standard deviation. The bold black line measures the average return of episodes
contained in the offline data collected using the DDPG agent (with exploration noise). Similar to our results on Atari 2600 games, we find
that recent off-policy algorithms perform quite well in the offline setting with entire DDPG replay data.

serve that standard RL algorithms fail on Atari 2600 games significantly affect the results in the offline setting, which
when trained offline using trajectories collected by a single can explain some of the discrepancy in recent work.
partially-trained DQN policy. The majority of recent pa-
Our results emphasize the need for a rigorous characteriza-
pers focus on the offline RL setting with dataset(s) collected
tion of the role of generalization due to deep neural nets
using a single data collection policy (e.g., random, expert,
when learning from offline data collected from a large mix-
etc) and propose remedies by regularizing the learned policy
ture of (diverse) policies. Furthermore, they indicate the po-
to stay close to training trajectories. These approaches im-
tential of offline RL for creating a data-driven RL paradigm
prove stability in the offline setting, however, they introduce
where one could pretrain RL agents with large amounts of
additional regularization terms and hyper-parameters, the
existing diverse datasets before further collecting new data
selection of which is not straightforward (Wu et al., 2019).
via exploration, thus creating sample-efficient agents that
Fujimoto et al. (2019b) find that standard algorithms fail
can be deployed and continually learn in the real world.
in the offline setting even when large and diverse logged
datasets are available on continuous control tasks. Zhang
and Sutton (2017) find that large replay buffers can hurt the 8 Future Work
performance of simple Q-learning methods with weak func- Since the (observation, action, next observation, reward)
tion approximators; they attribute this to the “off-policyness” tuples in DQN replay dataset are stored in the order they
of a large buffer, which might delay important transitions. were experienced by online DQN during training, various
This paper focuses on the offline RL setting on Atari 2600 data collection strategies for benchmarking offline RL can
games with data collected from a large mixture of policies be induced by subsampling the replay dataset containing
seen during the optimization of a DQN agent, rather than a 200 million frames. For example, the first k million frames
single Markovian behavior policy. Our batch RL results on from the DQN replay dataset emulate exploration data with
MuJoCo tasks (Figure 8) and Atari games demonstrate that suboptimal returns (e.g., Figure 7) while the last k million
recent off-policy deep RL algorithms (e.g., TD3 (Fujimoto frames are analogous to near-expert data with stochasticity.
et al., 2018), QR-DQN (Dabney et al., 2018)) are effective Another option is to randomly subsample the entire dataset
in the offline setting when sufficiently large and diverse to create smaller offline datasets (e.g., Figure 6). Based on
datasets are available. Such algorithms do not explicitly the popularity and ease of experimentation on Atari 2600
correct for the distribution mismatch between the learned games, the DQN replay dataset can be used for benchmark-
policy and offline dataset. Recently, Cabi et al. (2019) also ing offline RL in addition to continuous control setups such
show success of standard RL algorithms in the offline set- as BRAC (Wu et al., 2019).
ting on large-scale robotic datasets. We suspect that the Experience replay-based algorithms can be more sample
suboptimal performance of DQN (Nature) in the offline set- efficient than model-based approaches (Van Hasselt et al.,
ting is related to the notion of off-policyness of large buffers 2019), and using the DQN replay dataset on Atari 2600
established by Zhang and Sutton (2017). However, robust games for designing non-parametric replay models (Pan
deep Q-learning algorithms such as REM are able to effec- et al., 2018) and parametric world models (Kaiser et al.,
tively exploit logged DQN data, given sufficient number of 2019) is another promising direction for improving sample-
gradient updates. Our experiments show that dataset size efficiency in RL. We also leave further investigation of the
and its diversity, as well as choice of the RL algorithms exploitation ability of distributional RL to future work.
An Optimistic Perspective on Offline Reinforcement Learning

9 Conclusions Sudeep Dasari, Frederik Ebert, Stephen Tian, Suraj Nair,


Bernadette Bucher, Karl Schmeckpeper, Siddharth Singh,
This paper investigates offline deep RL on Atari 2600 games Sergey Levine, and Chelsea Finn. Robonet: Large-scale multi-
based on logged experiences of a DQN agent. We demon- robot learning. CoRL, 2019.
strate that it is possible to learn policies with high returns Tim De Bruin, Jens Kober, Karl Tuyls, and Robert Babuška. The
that outperform the data collection policies using recent importance of experience replay database composition in deep
standard off-policy RL algorithms. We develop REM, a ro- reinforcement learning. Deep reinforcement learning workshop,
bust RL algorithm that that can effectively exploit off-policy NIPS, 2015.
data. The DQN replay dataset can serve as a testbed for of- J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei.
fline RL. Overall, we present an optimistic view that robust ImageNet: A Large-Scale Hierarchical Image Database. CVPR,
RL algorithms can be developed which can effectively learn 2009.
from large-scale off-policy datasets.
Gabriel Dulac-Arnold, Daniel Mankowitz, and Todd Hester. Chal-
lenges of real-world reinforcement learning. arXiv preprint
10 Acknowledgments arXiv:1904.12901, 2019.

We thank Pablo Samuel Castro for help in understanding Damien Ernst, Pierre Geurts, and Louis Wehenkel. Tree-based
and debugging certain issues with the Dopamine codebase batch mode reinforcement learning. JMLR, 2005.
and reviewing an early draft of the paper. We thank Robert Stefan Faußer and Friedhelm Schwenker. Neural network en-
Dadashi, Carles Gelada and Liam Fedus for helpful dis- sembles in reinforcement learning. Neural Processing Letters,
cussions. We also acknowledge Marc Bellemare, Zafarali 2015.
Ahmed, Ofir Nachum, George Tucker, Archit Sharma, Avi-
Scott Fujimoto, Herke van Hoof, and David Meger. Addressing
ral Kumar and William Chan for their review of initial draft function approximation error in actor-critic methods. ICML,
of the paper. 2018.

Scott Fujimoto, Edoardo Conti, Mohammad Ghavamzadeh, and


References Joelle Pineau. Benchmarking batch deep reinforcement learning
algorithms. arXiv preprint arXiv:1910.01708, 2019a.
Oron Anschel, Nir Baram, and Nahum Shimkin. Averaged-dqn:
Variance reduction and stabilization for deep reinforcement Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep
learning. ICML, 2017. reinforcement learning without exploration. ICML, 2019b.
Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowl- Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine.
ing. The arcade learning environment: An evaluation platform Soft actor-critic: Off-policy maximum entropy deep rein-
for general agents. Journal of Artificial Intelligence Research, forcement learning with a stochastic actor. arXiv preprint
2013. arXiv:1801.01290, 2018.
Marc G Bellemare, Will Dabney, and Rémi Munos. A distribu- PJ Huber. Robust estimation of a location parameter. Ann. Math.
tional perspective on reinforcement learning. ICML, 2017. Stat., 1964.
Richard Bellman. Dynamic Programming. Princeton University Natasha Jaques, Asma Ghandeharioun, Judy Hanwen Shen, Craig
Press, 1957. Ferguson, Agata Lapedriza, Noah Jones, Shixiang Gu, and
Léon Bottou, Jonas Peters, Joaquin Quiñonero-Candela, Denis X Rosalind Picard. Way off-policy batch deep reinforcement
Charles, D Max Chickering, Elon Portugaly, Dipankar Ray, learning of implicit human preferences in dialog. arXiv preprint
Patrice Simard, and Ed Snelson. Counterfactual reasoning and arXiv:1907.00456, 2019.
learning systems: The example of computational advertising.
JMLR, 2013. Stratton C Jaquette. Markov decision processes with a new opti-
mality criterion: Discrete time. The Annals of Statistics, 1973.
Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schnei-
der, John Schulman, Jie Tang, and Wojciech Zaremba. OpenAI Lukasz Kaiser, Mohammad Babaeizadeh, Piotr Milos, Blazej
Gym. arXiv preprint arXiv:1606.01540, 2016. Osinski, Roy H Campbell, Konrad Czechowski, Dumitru Er-
han, Chelsea Finn, Piotr Kozakowski, Sergey Levine, et al.
Serkan Cabi, Sergio Gómez Colmenarejo, Alexander Novikov, Model-based reinforcement learning for atari. arXiv preprint
Ksenia Konyushkova, Scott Reed, Rae Jeong, Konrad Żołna, arXiv:1903.00374, 2019.
Yusuf Aytar, David Budden, Mel Vecerik, et al. A framework for
data-driven robotics. arXiv preprint arXiv:1909.12200, 2019. Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic
optimization. ICLR, 2015.
Pablo Samuel Castro, Subhodeep Moitra, Carles Gelada, Saurabh
Kumar, and Marc G Bellemare. Dopamine: A research frame- Aviral Kumar, Justin Fu, George Tucker, and Sergey Levine. Sta-
work for deep reinforcement learning. ArXiv:1812.06110, 2018. bilizing Off-Policy Q-Learning via Bootstrapping Error Reduc-
tion. NeurIPS, 2019.
Will Dabney, Mark Rowland, Marc G Bellemare, and Rémi Munos.
Distributional reinforcement learning with quantile regression. Sascha Lange, Thomas Gabel, and Martin Riedmiller. Batch
AAAI, 2018. reinforcement learning. Reinforcement learning, 2012.
An Optimistic Perspective on Offline Reinforcement Learning

Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya
Gradient-based learning applied to document recognition. Pro- Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way
ceedings of the IEEE, 1998. to prevent neural networks from overfitting. JMLR, 2014.
Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Alexander L. Strehl, John Langford, Lihong Li, and Sham Kakade.
Nature, 2015. Learning from logged implicit exploration data. NeurIPS, 2010.
Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nico- Richard S Sutton. Learning to predict by the methods of temporal
las Heess, Tom Erez, Yuval Tassa, David Silver, and Daan differences. Machine learning, 1988.
Wierstra. Continuous control with deep reinforcement learning.
ArXiv:1509.02971, 2015. Richard S Sutton and Andrew G Barto. Reinforcement learning:
An introduction. MIT press, 2018.
Long-Ji Lin. Self-improving reactive agents based on reinforce-
ment learning, planning and teaching. Machine learning, 1992. Adith Swaminathan and Thorsten Joachims. Batch learning from
logged bandit feedback through counterfactual risk minimiza-
Yao Liu, Adith Swaminathan, Alekh Agarwal, and Emma Brun-
tion. JMLR, 2015.
skill. Off-policy policy gradient with state distribution correc-
tion. arXiv preprint arXiv:1904.08473, 2019. Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5-rmsprop: Di-
Marlos C Machado, Marc G Bellemare, Erik Talvitie, Joel Veness, vide the gradient by a running average of its recent magnitude.
Matthew Hausknecht, and Michael Bowling. Revisiting the COURSERA: Neural networks for machine learning, 2012.
arcade learning environment: Evaluation protocols and open
Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics
problems for general agents. Journal of Artificial Intelligence
engine for model-based control. IROS, 2012.
Research, 2018.
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Hado Van Hasselt, Matteo Hessel, and John Aslanides. When to
Graves, Ioannis Antonoglou, Daan Wierstra, and Martin use parametric models in reinforcement learning? NeurIPS,
Riedmiller. Playing Atari with deep reinforcement learning. 2019.
arXiv:1312.5602, 2013. Christopher JCH Watkins and Peter Dayan. Q-learning. Machine
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Learning, 1992.
Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Mar-
Yifan Wu, George Tucker, and Ofir Nachum. Behavior
tin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al.
regularized offline reinforcement learning. arXiv preprint
Human-level control through deep reinforcement learning. Na-
arXiv:1911.11361, 2019.
ture, 2015.
Ian Osband, Charles Blundell, Alexander Pritzel, and Benjamin Fisher Yu, Wenqi Xian, Yingying Chen, Fangchen Liu, Mike Liao,
Van Roy. Deep exploration via bootstrapped dqn. NeurIPS, Vashisht Madhavan, and Trevor Darrell. Bdd100k: A diverse
2016. driving video database with scalable annotation tooling. CVPR,
2018.
Yangchen Pan, Muhammad Zaheer, Adam White, Andrew Patter-
son, and Martha White. Organizing experience: a deeper look Shangtong Zhang and Richard S. Sutton. A deeper look at experi-
at replay mechanisms for sample-based planning in continuous ence replay. arXiv preprint arXiv:1712.01275, 2017.
state domains. IJCAI, 2018.
Martin L Puterman. Markov Decision Processes: Discrete Stochas-
tic Dynamic Programming. John Wiley & Sons, Inc., 1994.
Martin Riedmiller. Neural fitted q iteration–first experiences with
a data efficient neural reinforcement learning method. In Euro-
pean Conference on Machine Learning, 2005.
Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver.
Prioritized experience replay. ICLR, 2016.
Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis,
Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large
neural networks: The sparsely-gated mixture-of-experts layer.
arXiv:1701.06538, 2017.
Susan M Shortreed, Eric Laber, Daniel J Lizotte, T Scott Stroup,
Joelle Pineau, and Susan A Murphy. Informing sequential
clinical decision-making through reinforcement learning: an
empirical study. Machine learning, 2011.
Noah Siegel, Jost Tobias Springenberg, Felix Berkenkamp, Abbas
Abdolmaleki, Michael Neunert, Thomas Lampe, Roland Hafner,
Nicolas Heess, and Martin Riedmiller. Keep doing what worked:
Behavior modelling priors for offline reinforcement learning.
ICLR, 2020.
Supplementary Material for An Optimistic Perspective on Offline Reinforcement Learning

A Appendix ing to the k-th head. Note that Qα∗ ,θ∗ can also be repre-
sented by each of the individual Q-heads using a weight
A.1 Proofs ∗
vector given by convex combination P α of weight
 vectors
1 K K ∗ k
Proposition 1. Consider the assumptions: (a) The distribu- (wθ∗ , · · · , wθ∗ ), i.e., Q(s, a) = k=1 k θ ∗ · fθ (s, a).
α w ∗

tion P∆ has full support over the entire (K − 1)-simplex.


(b) Only a finite number of distinct Q-functions globally Let θI be such that QkθI = Qα∗ ,θ∗ for all Q-heads. By
minimize the loss in (3). (c) Q∗ is defined in terms of the definition of Qα∗ ,θ∗ , for all α ∼ P∆ , L(α, θI ) = L(α∗ , θ∗ )
MDP induced by the data distribution D. (d) Q∗ lies in the which implies that L(θI ) = L(α∗ , θ∗ ). Hence, θI corre-
family of our function approximation. Then at the global sponds to one of the global minima of L(θ) and any global
minimum of L(θ) (7) for multi-head Q-network : minima of L(θ) attains a value of L(α∗ , θ∗ ).

(i) Under assumptions (a) and (b), all the Q-heads repre- Since L(α, θ) ≥ L(α∗ , θ∗ ) for any α ∈ ∆K−1 , for any θM
sent identical Q-functions. such that L(θM ) = L(α∗ , θ∗ ) implies that L(α, θM ) =
(ii) Under assumptions (a)–(d), the common convergence L(α∗ , θ∗ ) for any α ∼ P∆ . Therefore, at any global mini-
point is Q∗ . mum of L(θ), the Q-function heads Qkθ for k = 1, . . . , K
minimize L(α, θ) for any α ∈ ∆K−1 .
Proof. Part (i): Under assumptions (a) and (b), we would
prove by contradiction that each Q-head should be identical A.2 Offline continuous control experiments
to minimize the REM loss L(θ) (7). Note that we consider
two Q-functions to be distinct only if they differ on any state We replicated the final buffer setup as described by Fujimoto
s in D. et al. (2019b): We train a DDPG (Lillicrap et al., 2015) agent
for 1 million time steps three standard MuJoCo continuous
The REM loss L(θ) = Eα∼P ∆ [L(α, θ)] where L(α, θ) is control environments in OpenAI gym (Todorov et al., 2012;
given by Brockman et al., 2016), adding N (0, 0.5) Gaussian noise
0 to actions for high exploration, and store all experienced
L(α, θ) = Es,a,r,s0 ∼D `λ ∆α
 
θ (s, a, r, s ) , (8) transitions. This collection procedure creates a dataset with
X X
α k k 0 0 a diverse set of states and actions, with the aim of sufficient
∆θ = αk Qθ (s, a) − r − γ max 0
αk Qθ0 (s , a )
a
k k coverage. Similar to Fujimoto et al. (2019b), we train DDPG
across 15 seeds, and select the 5 top performing seeds for
If the heads Qiθ and Qjθ don’t converge to identical Q-values dataset collection.
at the global minimum of L(θ), it can be deduced using
Using this logged dataset, we train standard continuous
Lemma 1 that all the Q-functions given by the convex com-
control off-policy actor-critic methods namely DDPG and
bination αi Qiθ + αj Qjθ such that αi + αj = 1 minimizes
TD3 (Fujimoto et al., 2018) completely offline without any
the loss in (3). This contradicts the assumption that only
exploration. We also train a Batch-Constrained deep Q-
a finite number of distinct Q-functions globally minimize
learning (BCQ) agent, proposed by Fujimoto et al. (2019b),
the loss in (3). Hence, all Q-heads represent an identical
which restricts the action space to force the offline agent to-
Q-function at the global minimum of L(θ).
wards behaving close to on-policy w.r.t. a subset of the given
Lemma 1. Assuming that the distribution P∆ has full sup- data. We use the open source code generously provided by
port over the entire (K − 1)-simplex ∆K−1 , then at any the authors at https://github.com/sfujim/BCQ
global minimum of L(θ), the Q-function heads Qkθ for and https://github.com/sfujim/TD3. We use
k = 1, . . . , K minimize L(α, θ) for any α ∈ ∆K−1 . the hyperparameters mentioned in (Fujimoto et al., 2018;
PK ∗ k 2019b) except offline TD3 which uses a learning rate of
Proof. Let Qα∗ ,θ∗ = k=1 αk Qθ ∗ (s, a) corresponding 0.0005 for both the actor and critic.
to the convex combination α = (α1∗ , · · · , αK
∗ ∗
) represents
one of the global minima of L(α, θ) (8) i.e., L(α∗ , θ∗ ) = Figure 8 show that offline TD3 significantly outperforms
min L(α, θ) where α ∈ ∆K−1 . Any global minima of L(θ) the behavior policy which collected the offline data as well
α,θ
as the offline DDPG agent. Noticeably, offline TD3 also
attains a value of L(α∗ , θ∗ ) or higher since,
performs comparably to BCQ, an algorithm designed specif-
ically to learn from arbitrary, fixed offline data. While Fuji-
L(θ) = Eα∼P ∆ [L(α, θ)] (9)
∗ ∗ ∗ ∗
moto et al. (2019b) attribute the failure to learn in the offline
≥ Eα∼P ∆ [L(α , θ )] ≥ L(α , θ ) setting to extrapolation error (i.e., the mismatch between the
offline dataset and true state-action visitation of the current
Let Qkθ∗ (s, a) = wθk∗ · fθ∗ (s, a) where fθ∗ (s, a) ∈ RD rep- policy), our results suggest that failure to learn from diverse
resent the shared features among the Q-heads and wθk∗ ∈ offline data may be linked to extrapolation error for only
RD represent the weight vector in the final layer correspond- weak exploitation agents such as DDPG.
Supplementary Material for An Optimistic Perspective on Offline Reinforcement Learning

A.3 Score Normalization policy agents, step size and optimizer were taken as pub-
lished. Offline DQN (Adam) and all the offline agents
The improvement in normalized performance of an of-
with multi-head Q-network (Figure 2) use the Adam op-
fline agent, expressed as a percentage, over an online
timizer (Kingma and Ba, 2015) with same hyperparameters
DQN (Nature) (Mnih et al., 2015) agent is calculated as:
as online QR-DQN (Dabney et al., 2018) (lr = 0.00005,
100 × (Scorenormalized − 1) where:
Adam = 0.01/32). Note that scaling the loss has the same
ScoreAgent − Scoremin effect as inversely scaling Adam when using Adam.
Scorenormalized = , (10)
Scoremax − Scoremin Online Agents. For online REM shown in Figure 1b, we
Scoremin = min(ScoreDQN , ScoreRandom ) performed hyper-parameter tuning over Adam in (0.01/32,
Scoremax = max(ScoreDQN , ScoreRandom ) 0.005/32, 0.001/32) over 5 training games (Asterix, Break-
out, Pong, Q*Bert, Seaquest) and evaluated on the full set of
Here, ScoreDQN , ScoreRandom and ScoreAgent are the 60 Atari 2600 games using the best setting (lr = 0.00005,
mean evaluation scores averaged over 5 runs. We chose Adam = 0.001/32). Online REM uses 4 Q-value esti-
not to measure performance in terms of percentage of online mates calculated using separate Q-networks where each
DQN scores alone because a tiny difference relative to the network has the same architecture as originally used by
random agent on some games can translate into hundreds of online DQN (Nature). Similar to REM, our version of
percent in DQN score difference. Additionally, the max is Bootstrapped-DQN also uses 4 separate Q-networks and
needed since DQN performs worse than a random agent on Adam optimizer with identical hyperaparmeters (lr =
the games Solaris and Skiing. 0.00005, Adam = 0.001/32).

A.4 Hyperparameters & Experiment Details Wall-clock time for offline experiments. The offline ex-
periments are approximately 3X faster than the online ex-
In our experiments, we used the hyperparameters provided periments for the same number of gradient steps on a P100
in Dopamine baselines (Castro et al., 2018) and report them GPU. In Figure 4 and ??(a), the offline agents are trained for
for completeness and ease of reproducibility in Table 2. 5X gradient steps, thus, the experiments are 1.67X slower
As mentioned by Dopamine’s GitHub repository, chang- than running online DQN for 200 million frames (standard
ing these parameters can significantly affect performance, protocol). Furthermore, since the offline experiments do not
without necessarily being indicative of an algorithmic dif- require any data generation, using tricks from supervised
ference. We will also open source our code to further aid in learning such as using much larger batch sizes than 32 with
reproducing our results. TPUs / multiple GPUs would lead to a significant speed up.
The Atari environments (Bellemare et al., 2013) used in our
experiments are stochastic due to sticky actions (Machado A.5 Additional Plots & Tables
et al., 2018), i.e., there is 25% chance at every time step that
the environment will execute the agent’s previous action
again, instead of the agent’s new action. All agents (online
or offline) are compared using the best evaluation score (av-
eraged over 5 runs) achieved during training where the evalu-
ation is done online every training iteration using a -greedy
policy with  = 0.001. We report offline training results
with same hyperparameters over 5 random seeds of the
DQN replay data collection, game simulator and network
initialization.
DQN replay dataset collection. For collecting the offline
data used in our experiments, we use online DQN (Na-
ture) (Mnih et al., 2015) with the RMSprop (Tieleman
and Hinton, 2012) optimizer. The DQN replay dataset,
BDQN , consists of approximately 50 million experience
tuples for each run per game corresponds to 200 million
frames due to frame skipping of four, i.e., repeating a
selected action for four consecutive frames. Note that
the total dataset size is approximately 15 billion tuples (
50 million tuples
agent ∗ 5 agents
game ∗ 60 games).

Optimizer related hyperparameters. For existing off-


Supplementary Material for An Optimistic Perspective on Offline Reinforcement Learning

Table 2: The hyperparameters used by the offline and online RL agents in our experiments.

Hyperparameter setting (for both variations)


Sticky actions Yes
Sticky action probability 0.25
Grey-scaling True
Observation down-sampling (84, 84)
Frames stacked 4
Frame skip (Action repetitions) 4
Reward clipping [-1, 1]
Terminal condition Game Over
Max frames per episode 108K
Discount factor 0.99
Mini-batch size 32
Target network update period every 2000 updates
Training steps per iteration 250K
Update period every 4 steps
Evaluation  0.001
Evaluation steps per iteration 125K
Q-network: channels 32, 64, 64
Q-network: filter size 8 × 8, 4 × 4, 3 × 3
Q-network: stride 4, 2, 1
Q-network: hidden units 512
Multi-head Q-network: number of Q-heads 200
Hardware Tesla P100 GPU
Hyperparameter Online Offline
Min replay size for sampling 20,000 -
Training  (for -greedy exploration) 0.01 -
-decay schedule 250K steps -
Fixed Replay Memory No Yes
Replay Memory size 1,000,000 steps 50,000,000 steps
Replay Scheme Uniform Uniform
Training Iterations 200 200 or 1000

Table 3: Median normalized scores (Section A.3) across stochastic version of 60 Atari 2600 games, measured as percentages and number
of games where an agent achieves better scores than a fully trained online DQN (Nature) agent. All the offline agents below are trained
using the DQN replay dataset. The entries of the table without any suffix report training results with the five times as many gradient steps
as online DQN while the entires with suffix (1x) indicates the same number of gradient steps as the online DQN agent. All the offline
agents except DQN use the same multi-head architecture as QR-DQN.

Offline agent Median > DQN Offline agent Median > DQN
DQN (Nature) (1x) 74.4% 10 DQN (Nature) 83.4% 17
DQN (Adam) (1x) 104.6% 39 DQN (Adam) 111.9% 41
Ensemble-DQN (1x) 92.5% 26 Ensemble-DQN 111.0% 39
Averaged Ensemble-DQN (1x) 88.6% 24 Averaged Ensemble-DQN 112.1% 43
QR-DQN (1x) 115.0% 44 QR-DQN 118.9% 45
REM (1x) 103.7% 35 REM 123.8% 49
% Improvement (Log scale) % Improvement (Log scale)

Average Returns

−103
−102
−101
−100
0
100
101
102
103
−103
−102
−101
−100
0
100
101
102
103
M ONTEZUMA’ S R EVENGE S OLARIS

20000
25000
30000

0
5000
10000
15000
S OLARIS E LEVATOR A CTION
D OUBLE D UNK A STEROIDS

0
K ANGAROO M ONTEZUMA’ S R EVENGE
A MIDAR P RIVATE E YE
H.E.R.O A TLANTIS
J AMES B OND A MIDAR
S KIING R IVER R AID

50
C ENTIPEDE K ANGAROO
A TLANTIS J AMES B OND
K RULL S TAR G UNNER
C RAZY C LIMBER V IDEO P INBALL
C ARNIVAL D OUBLE D UNK
P ONG B EAM R IDER

100
T ENNIS P OOYAN
P ITFALL ! C ENTIPEDE

Iteration
A STERIX
YARS ’ S R EVENGE A SSAULT
J OURNEY E SCAPE M S. PAC -M AN
M S. PAC -M AN V ENTURE

150
F ISHING D ERBY A LIEN
B EAM R IDER C ARNIVAL
R OBOTANK B ANK H EIST

Offline QR-DQN (5X)


S TAR G UNNER K UNG -F U M ASTER
Offline Ensemble-DQN

Q*B ERT F REEWAY

200
A LIEN C RAZY C LIMBER
B OXING K RULL
R OAD R UNNER P ONG
B ANK H EIST A IR R AID
R IVER R AID H.E.R.O

−50
0
50
100
150
200
250
300
350
400
F REEWAY S PACE I NVADERS

(a)
(a)

0
K UNG -F U M ASTER R OBOTANK
V IDEO P INBALL T UTANKHAM

Game
Game
P HOENIX T ENNIS
T UTANKHAM YARS ’ S R EVENGE
B ATTLE Z ONE P ITFALL !
I CE H OCKEY J OURNEY E SCAPE

50
B OWLING Q*B ERT
P OOYAN G OPHER
F ROSTBITE F ISHING D ERBY
Z AXXON B OXING
A IR R AID R OAD R UNNER
W IZARD O F W OR E NDURO

100
A STERIX Z AXXON

Iteration
V ENTURE N AME T HIS G AME
U P AND D OWN T IME P ILOT

B REAKOUT
N AME T HIS G AME P HOENIX
E NDURO U P AND D OWN

150
B REAKOUT B ATTLE Z ONE
S EAQUEST C HOPPER C OMMAND
P RIVATE E YE I CE H OCKEY
A SSAULT S KIING
G OPHER B ERZERK

200
G RAVITAR A STERIX
T IME P ILOT B REAKOUT
C HOPPER C OMMAND G RAVITAR
S PACE I NVADERS F ROSTBITE
A STEROIDS S EAQUEST

2000
4000
6000
8000
10000
12000
14000
16000
18000

0
D EMON A TTACK D EMON A TTACK

0
DQN
B ERZERK W IZARD O F W OR
E LEVATOR A CTION B OWLING

DQN
DQN

50
Random
Random

% Improvement (Log scale) % Improvement (Log scale)

100
QR-DQN

Iteration
Q*B ERT
−103
−102
−101
−100
0
100
101
102
103
−103
−102
−101
−100
0
100
101
102
103

K ANGAROO S OLARIS

150
V IDEO P INBALL E LEVATOR A CTION
A MIDAR B ERZERK
D OUBLE D UNK K ANGAROO
J AMES B OND A TLANTIS
A TLANTIS V IDEO P INBALL

200
REM
C ARNIVAL A MIDAR
K RULL C ENTIPEDE
F REEWAY D OUBLE D UNK
T UTANKHAM J AMES B OND
T ENNIS S TAR G UNNER

−5000
40000

0
5000
10000
15000
20000
25000
30000
35000
P ITFALL ! K RULL

0
P ONG V ENTURE
B ANK H EIST C ARNIVAL
J OURNEY E SCAPE A SSAULT
Offline REM

H.E.R.O F REEWAY
S TAR G UNNER B EAM R IDER
A SSAULT P OOYAN

50
M S. PAC -M AN A LIEN
Offline REM (5X)

Q*B ERT R OBOTANK


YARS ’ S R EVENGE K UNG -F U M ASTER
R OBOTANK P ONG
F ISHING D ERBY T UTANKHAM

100
K UNG -F U M ASTER H.E.R.O
C RAZY C LIMBER T ENNIS

Iteration
B OXING P ITFALL !

S EAQUEST
R OAD R UNNER R IVER R AID

traces) and smoothed over a sliding window of 5 iterations and error bands show standard deviation.
B EAM R IDER Q*B ERT

150
C ENTIPEDE M S. PAC -M AN
A LIEN B ANK H EIST
B OWLING YARS ’ S R EVENGE

(b)
(b)

R IVER R AID J OURNEY E SCAPE

Game
Game

M ONTEZUMA’ S R EVENGE F ISHING D ERBY


P OOYAN B OXING

200
B ATTLE Z ONE C RAZY C LIMBER
A IR R AID R OAD R UNNER
E NDURO B OWLING
N AME T HIS G AME G OPHER
U P AND D OWN M ONTEZUMA’ S R EVENGE

−5000
10000
15000
20000
25000

0
5000
I CE H OCKEY A IR R AID

0
P HOENIX S KIING
V ENTURE E NDURO
A STERIX G RAVITAR
F ROSTBITE N AME T HIS G AME
Z AXXON I CE H OCKEY
Supplementary Material for An Optimistic Perspective on Offline Reinforcement Learning

50
S OLARIS S PACE I NVADERS
B REAKOUT B ATTLE Z ONE
S EAQUEST U P AND D OWN
P RIVATE E YE T IME P ILOT
G RAVITAR Z AXXON
A STEROIDS P HOENIX

100
S KIING A STERIX
C HOPPER C OMMAND F ROSTBITE

Iteration
S PACE I NVADERS P RIVATE E YE
G OPHER B REAKOUT
W IZARD O F W OR C HOPPER C OMMAND

150
D EMON A TTACK S EAQUEST

S PACE I NVADERS
B ERZERK A STEROIDS
T IME P ILOT W IZARD O F W OR
E LEVATOR A CTION D EMON A TTACK

200
each game is 0.0 and 1.0 for the worse and better performing agent among fully trained online DQN and random agents respectively.
each game is 0.0 and 1.0 for the worse and better performing agent among fully trained online DQN and random agents respectively.

DQN
DQN

Random
Random

Online REM vs. baselines. Scores for online agents trained for 200 million ALE frames. Scores are averaged over 3 runs (shown as
Figure A.2: Normalized Performance improvement (in %) over online DQN (Nature), per game, of (a) offline QR-DQN (5X) (b) offline
REM (5X) trained using the DQN replay dataset for five times as many gradient steps as online DQN. The normalized online score for
offline REM trained using the DQN replay dataset for same number of gradient steps as online DQN. The normalized online score for
Figure A.1: Normalized Performance improvement (in %) over online DQN (Nature), per game, of (a) offline Ensemble-DQN and (b)
Supplementary Material for An Optimistic Perspective on Offline Reinforcement Learning

Offline Multi-network REM Offline Multi-head REM Offline QR-DQN

A STERIX B REAKOUT P ONG


9000 450 30
8000 400
20
Average Scores

7000 350
6000 300 10
5000 250
0
4000 200
3000 150 −10
2000 100
−20
1000 50
0 0 −30
0 100 200 300 400 500 0 100 200 300 400 500 0 100 200 300 400 500
Iteration Iteration Iteration
Q*B ERT S EAQUEST S PACE I NVADERS
16000 16000 20000
14000 14000
Average Scores

12000 12000 15000


10000 10000
8000 8000 10000
6000 6000
4000 4000 5000
2000 2000
0 0 0
0 100 200 300 400 500 0 100 200 300 400 500 0 100 200 300 400 500
Iteration Iteration Iteration

(a) REM with 4 Q-value estimates (K = 4)

Offline Multi-network REM Offline Multi-head REM Offline QR-DQN

A STERIX B REAKOUT P ONG


10000 450 30
400
8000 20
Average Scores

350
300 10
6000
250
0
200
4000
150 −10
2000 100
−20
50
0 0 −30
0 100 200 300 400 500 0 100 200 300 400 500 0 100 200 300 400 500
Iteration Iteration Iteration
Q*B ERT S EAQUEST S PACE I NVADERS
16000 18000 20000
14000 16000
Average Scores

12000 14000 15000


12000
10000
10000
8000 10000
8000
6000
6000
4000 4000 5000
2000 2000
0 0 0
0 100 200 300 400 500 0 100 200 300 400 500 0 100 200 300 400 500
Iteration Iteration Iteration

(b) REM with 16 Q-value estimates (K = 16)

Figure A.3: REM with Separate Q-networks. Average online scores of offline REM variants with different architectures and QR-DQN
trained on stochastic version of 6 Atari 2600 games for 500 iterations using the DQN replay dataset. The scores are averaged over 5 runs
(shown as traces) and smoothed over a sliding window of 5 iterations and error bands show standard deviation. The multi-network REM
and the multi-head REM employ K Q-value estimates computed using separate Q-networks and Q-heads of a multi-head Q-network
respectively and are optimized with identical hyperparameters. Multi-network REM improves upon the multi-head REM indicating that
the more diverse Q-estimates provided by the separate Q-networks improve performance of REM over Q-estimates provided by the
multi-head Q-network with shared features.
Supplementary Material for An Optimistic Perspective on Offline Reinforcement Learning

Online DQN (Nature) Offline DQN (Nature) Offline QR-DQN

A IR R AID A LIEN A MIDAR A SSAULT A STERIX A STEROIDS


18000 3500 2500 6000 8000 1400
16000 3000 7000 1200
2000 5000
14000
Average Score

2500 6000 1000


12000 1500 4000
2000 5000
10000 800
1500 1000 3000 4000
8000 600
1000 3000
6000 500 2000
500 2000 400
4000
0 1000 200
2000 0 1000
0 −500 −500 0 0 0
0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200
Iteration Iteration Iteration Iteration Iteration Iteration
A TLANTIS B ANK H EIST B ATTLE Z ONE B EAM R IDER B ERZERK B OWLING
1200000 1200 40000 8000 40000 140
1000000 1000 35000 7000 120
30000 30000 100
Average Score

800000 800 6000


25000 80
5000 20000
600000 600 20000 60
4000
400000 400 15000 40
3000 10000
10000 20
200000 200 2000
5000 0 0
0 0 0 1000 −20
−200000 −200 −5000 0 −10000 −40
0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200
Iteration Iteration Iteration Iteration Iteration Iteration
B OXING B REAKOUT C ARNIVAL C ENTIPEDE C HOPPER C OMMAND C RAZY C LIMBER
100 400 6000 6000 14000 120000
350 12000 100000
5000 5000
Average Score

50 300 10000 80000


4000 4000
250 8000
60000
0 200 3000 3000 6000
40000
150 4000
2000 2000
−50 100 2000 20000
1000 1000 0
50 0
−100 0 0 0 −2000 −20000
0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200
Iteration Iteration Iteration Iteration Iteration Iteration
D EMON A TTACK D OUBLE D UNK E LEVATOR A CTION E NDURO F ISHING D ERBY F REEWAY
50000 5 4000 2500 40 50

0 2000 20 40
40000 3000
Average Score

0
−5 1500 30
30000 2000 −20
−10 1000 20
20000 1000 −40
−15 500 10
−60
10000 −20 0 0 0
−80
0 −25 −1000 −500 −100 −10
0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200
Iteration Iteration Iteration Iteration Iteration Iteration
F ROSTBITE G OPHER G RAVITAR H.E.R.O I CE H OCKEY J AMES B OND
4000 25000 600 25000 0 1200
500 20000 1000
3000 20000 −5
Average Score

400 800
15000
2000 15000 300 −10 600
10000
1000 10000 200 −15 400
5000
100 200
0 5000 0 −20
0 0
−1000 0 −100 −5000 −25 −200
0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200
Iteration Iteration Iteration Iteration Iteration Iteration
J OURNEY E SCAPE K ANGAROO K RULL K UNG -F U M ASTER M ONTEZUMA’ S R EVENGE M S. PAC -M AN
0 16000 8000 35000 3.0 5000
−5000 14000 7000 30000 2.5
12000 6000 4000
Average Score

−10000 25000 2.0


10000 5000
−15000 20000 1.5 3000
8000 4000
−20000 15000 1.0
6000 3000
−25000 10000 0.5 2000
4000 2000
−30000 2000 1000 5000 0.0
1000
−35000 0 0 0 −0.5
−40000 −2000 −1000 −5000 −1.0 0
0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200
Iteration Iteration Iteration Iteration Iteration Iteration
N AME T HIS G AME P HOENIX P ITFALL ! P ONG P OOYAN P RIVATE E YE
14000 8000 500 30 6000 8000
12000 7000
0 20 5000 6000
Average Score

10000 6000
10 4000
8000 5000 −500 4000
6000 4000 0 3000
4000 3000 −1000 2000
−10 2000
2000 2000
−1500 −20 1000 0
0 1000
−2000 0 −2000 −30 0 −2000
0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200
Iteration Iteration Iteration Iteration Iteration Iteration
Q*B ERT R IVER R AID R OAD R UNNER R OBOTANK S EAQUEST S KIING
16000 16000 60000 70 10000 −5000
14000 14000 50000 60 −10000
8000
Average Score

12000 12000 40000 50 −15000


6000
10000 10000 30000
40 −20000
8000 8000 20000 4000
30 −25000
6000 6000 10000
2000
4000 4000 0 20 −30000
10 0 −35000
2000 2000 −10000
0 0 −20000 0 −2000 −40000
0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200
Iteration Iteration Iteration Iteration Iteration Iteration
S OLARIS S PACE I NVADERS S TAR G UNNER T ENNIS T IME P ILOT T UTANKHAM
2500 10000 70000 5 7000 350
60000 0 6000 300
2000 8000
Average Score

50000 −5 5000 250


1500
6000 40000 200
−10 4000
1000 30000 150
4000 −15 3000
20000 100
500
10000 −20 2000 50
0 2000
0 −25 1000 0
−500 0 −10000 −30 0 −50
0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200
Iteration Iteration Iteration Iteration Iteration Iteration
UP AND D OWN V ENTURE V IDEO P INBALL W IZARD O F W OR YARS ’ S R EVENGE Z AXXON
35000 350 600000 6000 35000 10000
30000 300 500000 5000 30000
8000
250
Average Score

25000 400000 4000 25000


200 6000
20000 20000
150 300000 3000
15000 15000 4000
100 200000 2000
10000 10000
50 2000
5000 100000 1000 5000
0
0 0 0
0 −50 0
−5000 −100 −100000 −1000 −5000 −2000
0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200
Iteration Iteration Iteration Iteration Iteration Iteration

Figure A.4: Average evaluation scores across stochastic version of 60 Atari 2600 games for online DQN, offline DQN and offline QR-DQN
trained for 200 iterations. The offline agents are trained using the DQN replay dataset. The scores are averaged over 5 runs (shown
as traces) and smoothed over a sliding window of 5 iterations and error bands show standard deviation. The horizontal line shows the
performance of the best policy (averaged over 5 runs) found during training of online DQN.
Supplementary Material for An Optimistic Perspective on Offline Reinforcement Learning

Offline DQN (Adam) Offline Ensemble-DQN Offline QR-DQN Offline REM

A IR R AID A LIEN A MIDAR A SSAULT A STERIX A STEROIDS


18000 5000 2000 6000 9000 2000
16000 8000
4000 1500 5000 1500
14000 7000
Average Score

12000 3000 4000 6000


1000 1000
10000 5000
2000 3000
8000 4000
500 500
6000 1000 2000 3000
4000 0 2000 0
0 1000
2000 1000
0 −1000 −500 0 0 −500
0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000
Iteration Iteration Iteration Iteration Iteration Iteration
A TLANTIS B ANK H EIST B ATTLE Z ONE B EAM R IDER B ERZERK B OWLING
1000000 1200 45000 8000 40000 70
1000 40000 7000 60
800000 30000
35000 50
Average Score

800 6000
600000 30000 40
600 5000 20000
25000 30
400000 400 4000
20000 20
200 3000 10000
200000 15000 10
0 10000 2000 0
0 0
−200 5000 1000 −10
−200000 −400 0 0 −10000 −20
0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000
Iteration Iteration Iteration Iteration Iteration Iteration
B OXING B REAKOUT C ARNIVAL C ENTIPEDE C HOPPER C OMMAND C RAZY C LIMBER
150 400 6000 8000 14000 200000
350 7000 12000
100 5000 150000
300
Average Score

6000 10000
250 4000
50 5000 8000 100000
200
3000 4000 6000
150
0 3000 4000 50000
100 2000
50 2000 2000
−50 1000 0
0 1000 0
−100 −50 0 0 −2000 −50000
0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000
Iteration Iteration Iteration Iteration Iteration Iteration
D EMON A TTACK D OUBLE D UNK E LEVATOR A CTION E NDURO F ISHING D ERBY F REEWAY
50000 0 50000 2500 40 50
40000 20
40000 −5 2000 40
Average Score

30000 0
30000 1500 30
−10 −20
20000
20000 1000 −40 20
−15 10000
−60
10000 500 10
0 −80
0 −20 0 0
−10000 −100
−10000 −25 −20000 −500 −120 −10
0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000
Iteration Iteration Iteration Iteration Iteration Iteration
F ROSTBITE G OPHER G RAVITAR H.E.R.O I CE H OCKEY J AMES B OND
5000 35000 800 25000 5 1000
30000 700
4000 20000 0 800
600
Average Score

25000
3000 500 15000 −5 600
20000
400
2000 15000 10000 −10 400
300
10000
1000 200 5000 −15 200
5000 100
0 0 −20 0
0 0
−1000 −5000 −100 −5000 −25 −200
0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000
Iteration Iteration Iteration Iteration Iteration Iteration
J OURNEY E SCAPE K ANGAROO K RULL K UNG -F U M ASTER M ONTEZUMA’ S R EVENGE M S. PAC -M AN
10000 16000 8000 40000 3.5 6000
14000 7000 35000 3.0
0 5000
12000 6000 30000 2.5
Average Score

−10000 10000 5000 25000 2.0 4000


8000 4000 20000 1.5
−20000 3000
6000 3000 15000 1.0
−30000 4000 2000 10000 0.5 2000
2000 1000 5000 0.0
−40000 1000
0 0 0 −0.5
−50000 −2000 −1000 −5000 −1.0 0
0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000
Iteration Iteration Iteration Iteration Iteration Iteration
N AME T HIS G AME P HOENIX P ITFALL ! P ONG P OOYAN P RIVATE E YE
16000 14000 500 30 7000 15000
14000 12000 6000
0 20
Average Score

12000 10000 5000 10000


10
10000 8000 −500 4000
8000 6000 0 5000
−1000 3000
6000 4000
−10
4000 2000 2000 0
−1500 −20
2000 0 1000
0 −2000 −2000 −30 0 −5000
0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000
Iteration Iteration Iteration Iteration Iteration Iteration
Q*B ERT R IVER R AID R OAD R UNNER R OBOTANK S EAQUEST S KIING
16000 25000 70000 80 12000 −5000
14000 60000 70 10000 −10000
20000
12000 50000 60
Average Score

8000 −15000
10000 15000 40000 50
8000 30000 40 6000 −20000
10000
6000 20000 30 4000 −25000
4000 5000 10000 20
2000 −30000
2000 0 10
0 0 −35000
0 −10000 0
−2000 −5000 −20000 −10 −2000 −40000
0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000
Iteration Iteration Iteration Iteration Iteration Iteration
S OLARIS S PACE I NVADERS S TAR G UNNER T ENNIS T IME P ILOT T UTANKHAM
2500 14000 70000 5 8000 350
12000 60000 0 7000 300
2000
50000
Average Score

10000 −5 6000 250


1500 40000
5000 200
8000 30000 −10
1000 4000 150
6000 20000 −15
3000 100
500 10000
4000 −20 2000 50
0
0 2000 −25
−10000 1000 0
−500 0 −20000 −30 0 −50
0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000
Iteration Iteration Iteration Iteration Iteration Iteration
U P AND D OWN V ENTURE V IDEO P INBALL W IZARD O F W OR YARS ’ S R EVENGE Z AXXON
60000 400 700000 10000 40000 12000
50000 600000 35000 10000
300 8000
Average Score

40000 500000 30000 8000


6000
200 400000 25000
30000 6000
300000 4000 20000
20000 100 4000
200000 15000
2000
10000 100000 10000 2000
0 0
0 0 5000 0
−10000 −100 −100000 −2000 0 −2000
0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000
Iteration Iteration Iteration Iteration Iteration Iteration

Figure A.5: Average evaluation scores across stochastic version of 60 Atari 2600 games of DQN (Adam), Ensemble-DQN, QR-DQN and
REM agents trained offline using the DQN replay dataset. The horizontal line for online DQN show the best evaluation performance it
obtains during training. All the offline agents except DQN use the same multi-head architecture with K = 200 heads. The scores are
averaged over 5 runs (shown as traces) and smoothed over a sliding window of 5 iterations and error bands show standard deviation.
Supplementary Material for An Optimistic Perspective on Offline Reinforcement Learning

DQN (Nature) C51 Bootstrapped-DQN QR-DQN REM

A IR R AID A LIEN A MIDAR A SSAULT A STERIX A STEROIDS


14000 7000 3000 6000 25000 1800
12000 6000 2500 1600
5000 20000 1400
Average Score

10000 5000 2000


4000 1200
8000 4000 1500 15000
1000
3000
6000 3000 1000 800
10000
2000 600
4000 2000 500
5000 400
2000 1000 0 1000
200
0 0 −500 0 0 0
0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200
Iteration Iteration Iteration Iteration Iteration Iteration
A TLANTIS B ANK H EIST B ATTLE Z ONE B EAM R IDER B ERZERK B OWLING
1200000 1400 40000 8000 7000 140
1200 35000 7000 6000 120
1000000
30000 5000 100
Average Score

1000 6000
800000 25000 4000 80
800 5000
20000 3000 60
600000 600 4000
15000 2000 40
400 3000
400000 10000 1000 20
200 5000 2000 0 0
200000
0 0 1000 −1000 −20
0 −200 −5000 0 −2000 −40
0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200
Iteration Iteration Iteration Iteration Iteration Iteration
B OXING B REAKOUT C ARNIVAL C ENTIPEDE C HOPPER C OMMAND C RAZY C LIMBER
150 400 6000 9000 10000 160000
350 8000 140000
100 5000 8000
7000
Average Score

300 120000
4000 6000 6000
50 250 100000
5000
200 3000 4000 80000
4000
0 150 60000
2000 3000 2000
100 2000 40000
−50 1000 0
50 1000 20000
−100 0 0 0 −2000 0
0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200
Iteration Iteration Iteration Iteration Iteration Iteration
D EMON A TTACK D OUBLE D UNK E LEVATOR A CTION E NDURO F ISHING D ERBY F REEWAY
18000 30 100000 2500 60 50
16000 40
20 80000 2000 40
14000
Average Score

20
12000 10 60000 1500 30
0
10000
0 40000 1000 −20 20
8000
−40
6000 −10 20000 500 10
4000 −60
−20 0 0 0
2000 −80
0 −30 −20000 −500 −100 −10
0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200
Iteration Iteration Iteration Iteration Iteration Iteration
F ROSTBITE G OPHER G RAVITAR H.E.R.O I CE H OCKEY J AMES B OND
7000 20000 1200 40000 10 3000
6000 1000 35000 5 2500
15000 30000
Average Score

5000 800 0 2000


25000
4000 10000 600 20000 −5 1500
3000
400 15000 −10 1000
2000 5000
10000
1000 200 −15 500
0 5000
0 0 0 −20 0
−1000 −5000 −200 −5000 −25 −500
0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200
Iteration Iteration Iteration Iteration Iteration Iteration
J OURNEY E SCAPE K ANGAROO K RULL K UNG -F U M ASTER M ONTEZUMA’ S R EVENGE M S. PAC -M AN
0 16000 10000 40000 2500 7000
14000 6000
−5000 8000 30000 2000
12000
Average Score

5000
−10000 10000 6000 1500
20000 4000
8000
−15000 4000 1000
6000 3000
10000
−20000 4000 2000 500
2000
2000 0
−25000 0 0 1000
0
−30000 −2000 −2000 −10000 −500 0
0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200
Iteration Iteration Iteration Iteration Iteration Iteration
N AME T HIS G AME P HOENIX P ITFALL ! P ONG P OOYAN P RIVATE E YE
25000 25000 200 30 8000 30000
7000 25000
20000 20000 0 20
6000
Average Score

20000
15000 10 5000
15000 −200 15000
4000
10000 0
3000 10000
10000 −400
5000 −10 2000
5000
5000 −600 1000
0 −20 0
0
−5000 0 −800 −30 −1000 −5000
0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200
Iteration Iteration Iteration Iteration Iteration Iteration
Q*B ERT R IVER R AID R OAD R UNNER R OBOTANK S EAQUEST S KIING
18000 25000 70000 70 70000 −5000
16000 60000 60 60000
20000 −10000
14000
Average Score

50000 50 50000
12000 −15000
15000 40000 40 40000
10000
30000 30 30000 −20000
8000
10000 20000 20 20000
6000 −25000
4000 10000 10 10000
5000 −30000
2000 0 0 0
0 0 −10000 −10 −10000 −35000
0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200
Iteration Iteration Iteration Iteration Iteration Iteration
S OLARIS S PACE I NVADERS S TAR G UNNER T ENNIS T IME P ILOT T UTANKHAM
3500 20000 80000 30 12000 300
3000 70000 20 10000 250
15000 60000
Average Score

2500 10 8000 200


50000
2000 10000
40000 0 6000 150
1500
30000 −10 4000 100
1000 5000
20000
500 −20 2000 50
0 10000
0 0 −30 0 0
−500 −5000 −10000 −40 −2000 −50
0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200
Iteration Iteration Iteration Iteration Iteration Iteration
UP AND D OWN V ENTURE V IDEO P INBALL W IZARD O F W OR YARS ’ S R EVENGE Z AXXON
40000 1600 700000 12000 60000 14000
35000 1400 600000 10000 12000
50000
30000 1200 10000
Average Score

500000 8000
25000 1000 40000 8000
400000
20000 800 6000 6000
300000 30000
15000 600 4000 4000
200000
10000 400 20000 2000
100000 2000
5000 200 0
0 10000
0 0 0 −2000
−5000 −200 −100000 −2000 0 −4000
0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200
Iteration Iteration Iteration Iteration Iteration Iteration

Figure A.6: Online results. Average evaluation scores across stochastic version of 60 Atari 2600 games of DQN, C51, QR-DQN,
Bootstrapped-DQN and REM agents trained online for 200 million game frames (standard protocol). The scores are averaged over 5
runs (shown as traces) and smoothed over a sliding window of 5 iterations and error bands show standard deviation.
Supplementary Material for An Optimistic Perspective on Offline Reinforcement Learning

Offline QR-DQN Offline REM

A IR R AID A LIEN A MIDAR A SSAULT A STERIX A STEROIDS


8000 1800 350 4500 5000 1400
7000 1600 300 4000 1200
6000 1400 3500 4000
Average Score

250 1000
5000 1200 3000
4000 1000 200 2500 3000 800
3000 800 150 2000 2000 600
2000 600 1500
100 400
1000 400 1000 1000
0 200 50 500 200
−1000 0 0 0 0 0
0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100
Iteration Iteration Iteration Iteration Iteration Iteration
A TLANTIS B ANK H EIST B ATTLE Z ONE B EAM R IDER B ERZERK B OWLING
40000 1200 35000 7000 700 60
35000 1000 30000 6000 600 50
Average Score

30000 800 25000 5000 40


25000 20000 500
600 4000 30
20000 15000 400
400 3000 20
15000 10000 300
10000 200 5000 2000 10
5000 0 0 1000 200 0
0 −200 −5000 0 100 −10
0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100
Iteration Iteration Iteration Iteration Iteration Iteration
B OXING B REAKOUT C ARNIVAL C ENTIPEDE C HOPPER C OMMAND C RAZY C LIMBER
100 120 5000 5000 4000 140000
80 100 4500 3500 120000
4000
4000
Average Score

60 80 3000 100000
40 3000 3500 2500 80000
60 3000
20 2000 2000 60000
40 2500
0 1000 2000 1500 40000
−20 20 1000 20000
1500
−40 0 0 500 0
1000
−60 −20 −1000 500 0 −20000
0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100
Iteration Iteration Iteration Iteration Iteration Iteration
D EMON A TTACK D OUBLE D UNK E LEVATOR A CTION E NDURO F ISHING D ERBY F REEWAY
14000 −5 200 1400 40 35
12000 1200 20 30
150 25
Average Score

10000 −10 1000 0


800 −20 20
8000 100 15
−15 600 −40
6000 50 10
400 −60 5
4000 −20 200 −80
0 0
2000 0 −100 −5
0 −25 −50 −200 −120 −10
0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100
Iteration Iteration Iteration Iteration Iteration Iteration
F ROSTBITE G OPHER G RAVITAR H.E.R.O I CE H OCKEY J AMES B OND
2000 3500 400 6000 5 600
3000 350 5000 500
1500 300 0
Average Score

2500 4000 400


250
1000 2000 200 3000 −5 300
500 1500 150 2000 −10 200
100
1000 1000 100
0 50 −15
500 0 0 0
−500 0 −50 −1000 −20 −100
0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100
Iteration Iteration Iteration Iteration Iteration Iteration
J OURNEY E SCAPE K ANGAROO K RULL K UNG -F U M ASTER M ONTEZUMA’ S R EVENGE M S. PAC -M AN
0 6000 7000 25000 0.25 3500
5000 6000 3000
−10000 20000 0.20
Average Score

4000 5000 2500


−20000 4000 15000 0.15
3000 2000
3000
−30000 2000 10000 0.10 1500
2000
1000 1000 1000
−40000 5000 0.05
0 0 500
−50000 −1000 −1000 0 0.00 0
0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100
Iteration Iteration Iteration Iteration Iteration Iteration
N AME T HIS G AME P HOENIX P ITFALL ! P ONG P OOYAN P RIVATE E YE
12000 8000 500 20 6000 400
7000 15
10000 0 5000 200
6000 10
Average Score

8000 5000 5 4000 0


4000 −500 0
6000 3000 −200
3000 −1000 −5
4000 2000 −10 2000 −400
1000 −1500 −15
2000 1000 −600
0 −20
0 −1000 −2000 −25 0 −800
0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100
Iteration Iteration Iteration Iteration Iteration Iteration
Q*B ERT R IVER R AID R OAD R UNNER R OBOTANK S EAQUEST S KIING
6000 8000 50000 50 2000 −10000
7000
5000 40000 40 1500 −15000
6000
Average Score

4000 5000
4000 30000 30 1000 −20000
3000
3000 20000 20 500 −25000
2000 2000
1000 10000 10 0 −30000
1000
0
0 −1000 0 0 −500 −35000
0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100
Iteration Iteration Iteration Iteration Iteration Iteration
S OLARIS S PACE I NVADERS S TAR G UNNER T ENNIS T IME P ILOT T UTANKHAM
2500 1400 3500 5 4500 250
1200 3000 0 4000
2000 200
3500
Average Score

1000 2500 −5
1500 800 2000 3000 150
−10 2500
1000 600 1500 100
−15 2000
500 400 1000 1500 50
200 500 −20
1000
0 0 0 −25 0
500
−500 −200 −500 −30 0 −50
0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100
Iteration Iteration Iteration Iteration Iteration Iteration
U P AND D OWN V ENTURE V IDEO P INBALL W IZARD O F W OR YARS ’ S R EVENGE Z AXXON
20000 120 350000 2000 18000 3000
100 300000 16000 2500
15000 1500
Average Score

80 250000 14000 2000


10000 60 200000 1000 12000 1500
40 150000
5000 500 10000 1000
20 100000
0 50000 8000 500
0 0
−20 0 6000 0
−5000 −40 −50000 −500 4000 −500
0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100
Iteration Iteration Iteration Iteration Iteration Iteration

Figure A.7: Effect of Dataset Quality. Normalized scores (averaged over 3 runs) of QR-DQN and multi-head REM trained offline on
stochastic version of 60 Atari 2600 games for 5X gradient steps using logged data from online DQN trained only for 20M frames (20
iterations). The horizontal line shows the performance of best policy found during DQN training for 20M frames which is significantly
worse than fully-trained DQN.

You might also like