Professional Documents
Culture Documents
Maximilian Igl 1 Luisa Zintgraf 1 Tuan Anh Le 1 Frank Wood 2 Shimon Whiteson 1
Abstract
Many real-world sequential decision making prob-
lems are partially observable by nature, and the
arXiv:1806.02426v1 [cs.LG] 6 Jun 2018
trajectories since distinct histories can lead to similar or ing to the transition distribution, st+1 ∼ F (st+1 |st , at ).
identical beliefs. Subsequently, the agent receives a noisy or partially oc-
cluded observation ot+1 ∈ O according to the distribution
The premise of this work is that deep policy learning for
ot+1 ∼ U (ot+1 |st+1 , at ), and a reward rt+1 ∈ R according
POMDP s can be improved by taking less of a black box
to the distribution rt+1 ∼ R(rt+1 |st+1 , at ).
approach than DRQN and ADRQN. While we do not want to
assume prior knowledge of the transition and observation An agent acts according to its policy π(at |o≤t , a<t ) which
functions or the latent state representation, we want to allow returns the probability of taking action at at time t, and
the agent to learn models of them and infer the belief state where o≤t = (o1 , . . . , ot ) and a<t = (a0 , . . . , at−1 ) are the
using this learned model. observation and action histories, respectively. The agent’s
goal is to learn a policy π that maximises the expected future
To this end, we propose DVRL, which implements this ap-
return
proach by providing a helpful inductive bias to the agent. " T #
In particular, we develop an algorithm that can learn an
X
J = Ep(τ ) γ t−1 rt , (1)
internal generative model and use it to perform approximate t=1
inference to update the belief state. Crucially, the generative
model is not only learned based on an ELBO objective, but over trajectories τ = (s0 , a0 , . . . , aT −1 , sT ) induced by its
also by how well it enables maximisation of the expected policy1 , where 0 ≤ γ < 1 is the discount factor. We follow
return. This ensures that, unlike in an unsupervised appli- the convention of setting a0 to no-op (Zhu et al., 2017).
cation of variational autoencoders (VAEs), the latent state In general, a POMDP agent must condition its actions on the
representation and the inference performed on it are suitable entire history (o≤t , a<t ) ∈ Ht . The exponential growth in
for the ultimate control task. Specifically, we develop an t of Ht can be addressed, e.g., with suffix trees (McCallum
approximation to the ELBO based on autoencoding sequen- & Ballard, 1996; Shani et al., 2005; Bellemare et al., 2014;
tial Monte Carlo (AESMC) (Le et al., 2018), allowing joint Bellemare, 2015; Messias & Whiteson, 2017). However,
optimisation with the n-step policy gradient update. Uncer- those approaches suffer from large memory requirements
tainty in the belief state is captured by a particle ensemble. and are only suitable for small discrete observation spaces.
A high-level overview of our approach in comparison to
previous RNN-based methods is shown in Figure 1. Alternatively, it is possible to infer the filtering distribution
p(st |o≤t , a<t ) =: bt , called the belief state. This is a suffi-
We evaluate our approach on Mountain Hike and several cient statistic of the history that can be used as input to an
flickering Atari games. On Mountain Hike, a low dimen- optimal policy π ? (at |bt ). The belief space does not grow
sional, continuous environment, we can show that DVRL is exponentially, but the belief update step requires knowledge
better than an RNN based approach at inferring the required of the model:
information from past observations for optimal action se-
R
lection in a simple setting. Our results on flickering Atari bt U (ot+1 |st+1 , at )F (st+1 |st , at )dst
show that this advantage extends to complex environments bt+1 = R R .
bt U (ot+1 |st+1 , at )F (st+1 |st , at ) dst dst+1
with high dimensional observation spaces. Here, partial (2)
observability is introduced by (1) using only a single frame
as input at each time step and (2) returning a blank screen
instead of the true frame with probability 0.5. 2.2. Variational Autoencoder
We define a family of priors pθ (s) over some latent state
2. Background s and decoders pθ (o|s) over observations o, both param-
eterised by θ. A variational autoencoder (VAE) learns θ
In this section, we formalise POMDPs and provide back- by
ground on recent advances in VAEs that we use. Lastly, we PNmaximising(n) the sum of log marginal likelihood terms
R n=1 log p θ (o ) for a dataset (o(n) )N
n=1 where pθ (o) =
describe the policy gradient loss based on n-step learning pθ (o|s)pθ (s) ds (Rezende et al., 2014; Kingma & Welling,
and A2C. 2014)) . Since evaluating the log marginal likelihood is in-
tractable, the VAE instead maximises a sum of ELBOs where
2.1. Partially Observable Markov Decision Processes each individual ELBO term is a lower bound on the log
A partially observable Markov decision process (POMDP) is marginal likelihood,
a tuple (S, A, O, F, U, R, b0 ), where S is the state space, A
pθ (o|s)pθ (s)
the action space, and O the observation space. We denote ELBO (θ, φ, o) = Eqφ (s|o) log , (3)
as st ∈ S the latent state at time t, and the distribution qφ (s|o)
over initial states s0 as b0 , the initial belief state. When 1
The trajectory length T is stochastic and depends on the time
an action at ∈ A is executed, the state changes accord- at which the agent-environment interaction ends.
Deep Variational Reinforcement Learning
for a family of encoders qφ (s|o) parameterised by φ. This et al., 2017; Wu et al., 2017), the synchronous simplification
objective also forces qφ (s|o) to approximate the posterior of asynchronous advantage actor-critic (A3C) (Mnih et al.,
pθ (s|o) under the learned model. Gradients of (3) are esti- 2016). An actor-critic approach can cope with continuous
mated by Monte Carlo sampling with the reparameterisation actions and avoids the need to draw state-action sequences
trick (Kingma & Welling, 2014; Rezende et al., 2014). from a replay buffer. The method proposed in this paper is
however equally applicable to other deep RL algorithms.
2.3. VAE for Time Series For n-step learning, starting at time t, the current policy
For sequential data, we assume that a series of latent states performs ns consecutive steps in ne parallel environments.
s≤T gives rise to a series of observations o≤T . We consider The gradient update is based on this mini-batch of size
a family of generative models parameterised by θ that con- ne × ns . The target for the value-function Vη (st+i ), i =
sists of the initial distribution pθ (s0 ), transition distribution 0, . . . , ns − 1, parameterised by η, is the appropriately dis-
pθ (st |st−1 ) and observation distribution pθ (ot |st ). Given a counted sum of on-policy rewards up until time t + ns and
family of encoder distributions qφ (st |st−1 , ot ), we can also the off-policy bootstrapped value Vη− (st+ns ). The minus
estimate the gradient of the ELBO term in the same manner sign denotes that no gradients are propagated through this
as in (3), noting that: value. Defining the advantage function as
T nsX−i−1
Y
pθ (s≤T , o≤T ) = pθ (s0 ) pθ (st |st−1 )pθ (ot |st ), (4) At,i
η (st+i , at+i ) :=
γ j rt+i+j
t=1 j=0
(8)
T
Y + γ ns −i Vη− (st+ns ) − Vη (st+i ) ,
qφ (s≤T |o≤T ) = pθ (s0 ) qφ (st |st−1 , ot ), (5)
t=1 the A2C loss for the policy parameters ρ at time t is
where we slightly abuse notation for qφ by ignoring the fact 1
that we sample from the model pθ (s0 ) for t = 0. Le et al. LA
t (ρ) = −
ne ns
(2018), Maddison et al. (2017) and Naesseth et al. (2018) s −1
ne nX
X (9)
introduce a new ELBO objective based on sequential Monte log πρ (at+i |st+i )At,i,−
η (st+i , at+i ).
Carlo (SMC) (Doucet & Johansen, 2009) that allows faster envs i=0
learning in time series:
and the value function loss to learn η can be written as
" T K
!#
X 1 X k ne nXs −1
ELBO SMC (θ, φ, o≤T ) = E log wt , (6) 1 X
K LVt (η) = At,i (st+i , at+i )2 . (10)
t=1 k=1 ne ns envs i=0 η
where K is the number of particles and wtk is the weight
of particle k at time t. Each particle is a tuple containing Lastly, an entropy loss is added to encourage exploration:
a weight wtk and a value skt which is obtained as follows. ne nXs −1
Let sk0 be samples from pθ (s0 ) for k = 1, . . . , K. For 1 X
LH
t (ρ) = − H(πρ (·|st+i )), (11)
t = 1, . . . , T , the weights wtk are obtained by resampling ne ns envs i=0
the particle set (skt−1 )K k=1 proportionally to the previous
weights and computing where H(·) is the entropy of a distribution.
uk
pθ (skt |st−1
t−1
)pθ (ot |skt ) 3. Deep Variational Reinforcement Learning
wtk = uk
, (7)
qφ (skt |st−1
t−1
, ot ) Fundamentally, there are two approaches to aggregating the
history in the presence of partial observability: remembering
where skt corresponds to a value sampled from features of the past or maintaining beliefs.
ukt−1 t−1uk
qφ (·|st−1 , ot ) and st−1 corresponds to the re- In most previous work, including ADRQN (Zhu et al., 2017),
sampled particle with the ancestor index uk0 = k the current history (a≤t , o<t ) is encoded by an RNN, which
PK j
and ukt−1 ∼ Discrete((wt−1 k
/ j=1 wt−1 )K
k=1 ) for leads to the recurrent update equation for the latent state ht :
t = 2, . . . , T .
ht = RNNUpdateφ (ht−1 , at−1 , ot ) . (12)
2.4. A2C
Since this approach is model-free, it is unlikely to approx-
One way to learn the parameters ρ of an agent’s policy imate belief update steps, instead relying on memory or
πρ (at |st ) is to use n-step learning with A2C (Dhariwal simple heuristics.
Deep Variational Reinforcement Learning
Inspired by the premise that a good way to solve many triplet (hkt , ztk , wtk ) (Chung et al., 2015). The value hkt of
POMDP s involves (1) estimating the transition and obser- particle k is the latent state of an RNN; ztk is an additional
vation model of the environment, (2) performing inference stochastic latent state that allows us to learn stochastic tran-
under this model, and (3) choosing an action based on the sition models; and wtk assigns each particle an importance
inferred belief state, we propose deep variational reinforce- weight.
ment learning (DVRL). It extends the RNN-based approach
Our latent state b̂t is thus an approximation of the belief
to explicitly support belief inference. Training everything
state in our learned model
end-to-end shapes the learned model to be useful for the RL
task at hand, and not only for predicting observations. T
Y
We first explain our baseline architecture and training pθ (h≤T , z≤T , o≤T |a<T ) = pθ (h0 ) pθ (zt |ht−1 , at−1 )
t=1
method in Section 3.1. For a fair comparison, we modify
the original architecture of Zhu et al. (2017) in several ways. pθ (ot |ht−1 , zt , at−1 )δψθRNN (ht−1 ,zt ,at−1 ,ot ) (ht ), (14)
We find that our new baseline outperforms their reported
results in the majority of cases. with stochastic transition model pθ (zt |ht−1 , at−1 ), decoder
pθ (ot |ht−1 , zt , at−1 ), and deterministic transition function
In Sections 3.2 and 3.3, we explain our new latent belief ht = ψθRNN (ht−1 , zt , at−1 , ot ) which is denoted using the
state b̂t and the recurrent update function delta-distribution δ and for which we use an RNN. The
model is trained to jointly optimise the ELBO and the ex-
b̂t = BeliefUpdateθ,φ (b̂t−1 , at−1 , ot ) (13)
pected return.
which replaces (12). Lastly, in Section 3.4, we describe our
modified loss function, which allows learning the model 3.3. Recurrent Latent State Update
jointly with the policy. To update the latent state, we proceed as follows:
!
3.1. Improving the Baseline k
wt−1
k
ut−1 ∼ Discrete PK j
(15)
While previous work often used Q-learning to train the j=1 wt−1
policy (Hausknecht & Stone, 2015; Zhu et al., 2017; Foerster uk
et al., 2016; Narasimhan et al., 2015), we use n-step A2C. ztk ∼ qφ (ztk |ht−1
t−1
, at−1 , ot ) (16)
This avoids drawing entire trajectories from a replay buffer uk
hkt = ψθRNN (ht−1 , ztk , at−1 , ot )
t−1
(17)
and allows continuous actions.
uk uk
pθ (ztk |ht−1 , at−1 )pθ (ot |ht−1 , ztk , at−1 )
t−1 t−1
Furthermore, since A2C interleaves unrolled trajectories and wtk = uk
. (18)
performs a parameter update only every ns steps, it makes it qφ (ztk |ht−1
t−1
, at−1 , ot )
feasible to maintain an approximately correct latent state. A
small bias is introduced by not recomputing the latent state First, we resample particles based on their weight by draw-
after each gradient update step. ing ancestor indices ukt−1 . This improves model learning
(Le et al., 2018; Maddison et al., 2017) and allows us to
We also modify the implementation of backpropagation-
train the model jointly with the n-step loss (see Section 3.4).
throught-time (BPTT) for n-step A2C in the case of policies
with latent states. Instead of backpropagating gradients For k = 1 . . . K, new values for ztk are sampled from the
only through the computation graph of the current update uk
encoder qφ (ztk |ht−1
t−1
, at−1 , ot ) which conditions on the re-
involving ns steps, we set the size of the computation graph t−1 uk
independently to involve ng steps. This leads to an aver- sampled ancestor values ht−1 as well as the last actions
age BPTT-length of (ng − 1)/2.2 This decouples the bias- at−1 and current observation ot . Latent variables zt are
variance tradeoff of choosing ns from the bias-runtime trade- sampled using the reparameterisation trick. The values ztk ,
uk
t−1
off of choosing ng . Our experiments show that choosing together with ht−1 , at−1 and ot , are then passed to the
ng > ns greatly improves the agent’s performance. transition function ψθRNN to compute hkt .
The weights wtk measure how likely each new latent state
3.2. Extending the Latent State
value (ztk , hkt ) is under the model and how well it explains
For DVRL, we extend the latent state to be a set of K par- the current observation.
ticles, capturing the uncertainty in the belief state (Thrun,
To condition the policy on the belief b̂t = (ztk , hkt , wtk )K k=1 ,
2000; Silver & Veness, 2010). Each particle consists of the
we need to encode the set of particles into a vector represen-
2
This is implemented in PyTorch using the tation ĥt . We use a second RNN that sequentially takes in
retain graph=True flag in the backward() function. each tuple (ztk , hkt , wtk ) and its last latent state is ĥt .
Deep Variational Reinforcement Learning
Figure 2: Overview of DVRL. We do the following K times to compute our new belief b̂t : Sample an ancestor index ukt−1 based on the
1:K uk
previous weights wt−1 t−1
(Eq. 15). Pick the ancestor particle value ht−1 and use it to sample a new stochastic latent state ztk from the
k k
encoder qφ (Eq. 16). Compute ht (Eq. 17) and wt (Eq. 18). Aggregate all K values into the new belief b̂t and summarise them into a
vector representation ĥt using a second RNN. Actor and critic can now condition on ĥt and b̂t is used as input for the next timestep. Red
arrows denote random sampling, green arrows the aggregation of K values. Black solid arrows denote the passing of a value as argument
to a function and black dashed ones the evaluation of a value under a distribution. Boxes indicate neural networks. Distributions are
normal or Bernoulli distributions whose parameters are outputs of the neural network.
Additional encoders are used for at , ot and zt ; see Appendix conditioned ELBO:
A for details. Figure 2 summarises the entire update step.
1
Ep(τ ) ELBOSMC (o≤T |a<T ) =
T
3.4. Loss Function " T K
! #
1 X 1 X k
To encourage learning a model, we include the term Ep(τ ) E log wt a≤T , (21)
T t=1
K
k=1
ne nXs −1 K
! which is a conditional extension of (6) similar to the exten-
1 X 1 X k sion of VAEs by Sohn et al. (2015). The expectation over
LtELBO (θ, φ) = − log wt+i
ne ns envs i=0 K p(τ ) is approximated by sampling trajectories and the sum
k=1
PT
(19) over the entire trajectory is approximated by a sum
Pt=1
t+ns −1
i=t over only a part of it.
in each gradient update every ns steps. This leads to the The importance of the resampling step (15) in allowing this
overall loss: approximation becomes clear if we compare (21) with the
ELBO for the importance weighted autoencoder ( IWAE) that
does not include resampling (Doucet & Johansen, 2009;
LtDVRL (ρ, η, θ, φ) = LA H H
t (ρ, θ, φ) + λ Lt (ρ, θ, φ)+ Burda et al., 2016):
λV LVt (η, θ, φ) + λE LtELBO (θ, φ) . (20) " K T
! #
1 X Y k
ELBO IWAE (o≤T |a<T ) = E log wt a≤T .
K
Compared to (9), (10) and (11), the losses now also depend k=1 t=1
(22)
on the encoder parameters φ and, for DVRL, model parame-
Because this loss is not additive over time, we cannot ap-
ters θ, since the policy and value function now condition on
proximate it with shorter parts of the trajectory.
the latent states instead of st . By introducing the n-step ap-
proximation LELBO
t , we can learn θ and φ to jointly optimise
the ELBO and the RL loss LA H H
t + λ Lt + λ Lt .
V V 4. Related Work
If we assume that observations and actions are drawn from Most existing POMDP literature focusses on planning algo-
the stationary state distribution induced by the policy πρ , rithms, where the transition and observation functions, as
then LELBO
t is a stochastic approximation to the action- well as a representation of the latent state space, are known
Deep Variational Reinforcement Learning
(Barto et al., 1995; McAllester & Singh, 1999; Pineau et al., with a theory operating on belief states.
2003; Ross et al., 2008; Oliehoek et al., 2008; Roijers et al.,
2015). In most realistic domains however, these are not 5. Experiments
known a priori.
We evaluate DVRL on Mountain Hike and on flickering Atari.
There are several approaches that utilise RNNs in POMDPs
We show that DVRL deals better with noisy or partially oc-
(Bakker, 2002; Wierstra et al., 2007; Zhang et al., 2015;
cluded observations and that this scales to high dimensional
Heess et al., 2015), including multi-agent settings (Foerster
and continuous observation spaces like images and complex
et al., 2016), learning text-based fantasy games (Narasimhan
tasks. We also perform a series of ablation studies, showing
et al., 2015) or, most recently, applied to Atari (Hausknecht
the importance of using many particles, including the ELBO
& Stone, 2015; Zhu et al., 2017). As discussed in Section
training objective in the loss function, and jointly optimising
3, our algorithm extends those approaches by enabling the
the ELBO and RL losses.
policy to explicitly reason about a model and the belief state.
More details about the environments and model architec-
Another more specialised approach called QMDP-Net
tures can be found in Appendix A together with additional
(Karkus et al., 2017) learns a value iteration network (VIN)
results and visualisations. All plots and reported results are
(Tamar et al., 2016) end-to-end and uses it as a transition
smoothed over time and parallel executed environments. We
model for planning. However, a VIN makes strong assump-
average over five random seeds, with shaded areas indicating
tions about the transition function and in QMDP-Net the
the standard deviation.
belief update must be performed analytically.
The idea to learn a particle filter based policy that is trained 5.1. Mountain Hike
using policy gradients was previously proposed by Coquelin
et al. (2009). However, they assume a known model and 10 0.0
rely on finite differences for gradient estimation. −0.6
Instead of optimising an ELBO to learn a maximum- 5 −1.2
likelihood approximation for the latent representation and −1.8
corresponding transition and observation model, previous −2.4
0
work also tried to learn those dynamics using spectral meth-
−3.0
ods (Azizzadenesheli et al., 2016), a Bayesian approach
−5 −3.6
(Ross et al., 2011; Katt et al., 2017), or nonparametrically
DVRL −4.2
(Doshi-Velez et al., 2015). However, these approaches do RNN
not scale to large or continuous state and observation spaces. −4.8
−10
For continuous states, actions, and observations with Gaus- −10 −5 0 5 10
sian noise, a Gaussian process model can be learned (Deisen-
roth & Peters, 2012). An alternative to learning an (approxi- Figure 3: Mountain Hike is a continuous control task with obser-
mate) transition and observation model is to learn a model vation noise σo = 3. Background colour indicates rewards. Red
over trajectories (Willems et al., 1995). However, this is line: trajectory for RNN based encoder. Black line: trajectory for
again only possible for small, discrete observation spaces. DVRL encoder. Dots: received observations. Both runs share the
same noise values i,t . DVRL achieves higher returns (see Fig. 4)
Due to the complexity of the learning in POMDPs, previous by better estimating its current location and remaining on the high
work already found benefits to using auxiliary losses. Un- reward mountain ridge.
like the losses proposed by Lample & Chaplot (2017), we
do not require additional information from the environment. In this task, the agent has to navigate along a mountain
The UNREAL agent (Jaderberg et al., 2016) is, similarly ridge, but only receives noisy measurements of its current
to our work, motivated by the idea to improve the latent location. Specifically, we have S = O = A = R2 where
representation by utilising all the information already ob- st = [xt , yt ]T ∈ S and ot = [x̂t , ŷt ]T ∈ O are true and
tained from the environment. While their work focuses observed coordinates respectively and at = [∆xt , ∆yt ]T ∈
on finding unsupervised auxiliary losses that provide good A is the desired step. Transitions are given by st+1 =
training signals, our goal is to use the auxiliary loss to better st + ãt +s,t with s,t ∼ N (·|0, 0.25·I) and ãt is the vector
align the network computations with the task at hand by at with length capped to kãt k ≤ 0.5. Observations are
incorporating prior knowledge as an inductive bias. noisy with ot = st +o,t with o,t ∼ N (·|0, σo ·I) and σo ∈
{0, 1.5, 3}. The reward at each timestep is Rt = r(xt , yt )−
There is some evidence from recent experiments on the 0.01kat k where r(xt , yt ) is shown in Figure 3. The starting
dopamine system in mice (Babayan et al., 2018) showing position is sampled from s0 ∼ N (·|[−8.5, −8.5]T , I) and
that their response to ambiguous information is consistent each episode ends after 75 steps.
Deep Variational Reinforcement Learning
DVRL used 30 particles and we set ng = 25 for both RNN the actual observation. Furthermore, only one frame is
and DVRL. The latent state h for the RNN-encoder archi- used as the observation. This is in line with previous work
tecture was of dimension 256 and 128 for both z and h for (Hausknecht & Stone, 2015). We use a frameskip of four3
DVRL. Lastly, λE = 1 and ns = 5 were used, together with and for the stochastic Atari environments there is a 0.25
RMSProp with a learning rate of 10−4 for both approaches. chance of repeating the current action for a second time at
each transition.
The main difficulty in Mountain Hike is to correctly esti-
mate the current position. Consequently, the achieved return DVRL used 15 particles and we set ng = 50 for both agents.
reflects the capability of the network to do so. DVRL out- The dimension of h was 256 for both architectures, as was
performs RNN based policies, especially for higher levels the dimension of z. Larger latent states decreased the perfor-
of observation noise σo (Figure 4). In Figure 3 we compare mance for the RNN encoder. Lastly, λE = 0.1 and ns = 5
the different trajectories for RNN and DVRL encoders for was used with a learning rate of 10−4 for RNN and 2 · 10−4
the same noise, i.e. RNN
s,t = s,t
DVRL
and RNN DVRL
o,t = o,t for all for DVRL, selected out of a set of 6 different rates based on
t and σo = 3. DVRL is better able to follow the mountain the results on ChopperCommand.
ridge, indicating that its inference based history aggregation
Table 1 shows the results for the more challenging stochastic,
is superior to a largely memory/heuristics based one.
flickering environments. Results for the deterministic envi-
The example in Figure 3 is representative but selected for ronments, including returns reported for DRQN and ADRQN,
clarity: The shown trajectories have ∆J(σo = 3) = 20.7 can be found in Appendix A. DVRL significantly outper-
¯ o = 3) = 11.43 (see
compared to an average value of ∆J(σ forms the RNN-based policy on five out of ten games and
Figure 4). narrowly underperforms significantly on only one. This
shows that DVRL is viable for high dimensional observation
spaces with complex environmental models.
−100 Table 1: Returns on stochastic and flickering Atari environments,
averaged over 5 random seeds. Bold numbers indicate statistical
significance at the 5% level. Out of ten games, DVRL significantly
Return J¯
DVRL
−200 RNN
5
σo = 0
σo = 1.5 Env DVRL (±std) RNN(±std)
0 1 2 3
−250 σo = 3 σo
Pong 18.17(±2.67) 6.33(±3.03)
0.0 0.5 1.0 1.5 2.0 2.5 Chopper 6602(±449) 5150(±488)
Frames ×107 MsPacman 2221(±199) 2312(±358)
Centipede 4240(±116) 4395(±224)
Figure 4: Returns achieved in Mountain Hike. Solid lines: DVRL. BeamRider 1663(±183) 1801(±65)
Dashed lines: RNN. Colour: Noise levels. Inset: Difference in Frostbite 297(±7.85) 254(±0.45)
performance between RNN and DVRL for same level of noise: Bowling 29.53(±0.23) 30.04(±0.18)
¯ o ) = J(DVRL,
∆J(σ ¯ ¯
σo ) − J(RNN, σo ). DVRL achieves slighly
IceHockey −4.88(±0.17) −7.10(±0.60)
higher returns for the fully observable case and, crucially, its perfor-
mance deteriorates more slowly for increasing observation noise, DDunk −5.95(±1.25) −15.88(±0.34)
showing the advantage of DVRL’s inference computations in en- Asteroids 1539(±73) 1545(±51)
coding the history in the presence of observation noise.
8000
1 Particle DVRL DVRL
6000 3 Particles No ELBO RNN
10 Particles 6000 No joint optim ng = 5
30 Particles
6000 ng = 50
ng = 150
Return
Return
Return
4000 4000
4000
0 2 4 0 2 4 0 2 4
Frames ×107 Frames ×107 Frames ×107
(a) Influence of the particle number on per- (b) Performance of the full DVRL algorithm (c) Influence of the maximum backpropaga-
formance for DVRL. Only using one particle compared to setting λE = 0 (”No ELBO”) tion length ng on performance. Note that
is not sufficient to encode enough informa- or not backpropagating the policy gradients RNN suffers most from very short lengths.
tion in the latent state. through the encoder (”No joint optim”). This is consistent with our conjecture that
RNN relies mostly on memory, not inference.
Bellemare, Marc G. Count-based frequency estimation with Kaelbling, Leslie Pack, Littman, Michael L, and Cassandra,
bounded memory. In Twenty-Fourth International Joint Anthony R. Planning and acting in partially observable
Conference on Artificial Intelligence, 2015. stochastic domains. Artificial intelligence, 101(1), 1998.
Bellemare, Marc G, Naddaf, Yavar, Veness, Joel, and Bowl- Karkus, Peter, Hsu, David, and Lee, Wee Sun. Qmdp-net:
ing, Michael. The arcade learning environment: An eval- Deep learning for planning under partial observability. In
uation platform for general agents. Journal of Artificial Advances in Neural Information Processing Systems, pp.
Intelligence Research, 47:253–279, 2013. 4697–4707, 2017.
Katt, Sammie, Oliehoek, Frans A, and Amato, Christopher.
Burda, Yuri, Grosse, Roger, and Salakhutdinov, Ruslan.
Learning in pomdps with monte carlo tree search. In
Importance weighted autoencoders. In ICLR, 2016.
International Conference on Machine Learning, 2017.
Chung, Junyoung, Kastner, Kyle, Dinh, Laurent, Goel,
Kingma, Diederik P and Welling, Max. Auto-encoding
Kratarth, Courville, Aaron C, and Bengio, Yoshua. A
variational Bayes. In ICLR, 2014.
recurrent latent variable model for sequential data. In Ad-
vances in neural information processing systems, 2015. Lample, Guillaume and Chaplot, Devendra Singh. Playing
fps games with deep reinforcement learning. In AAAI, pp.
Coquelin, Pierre-Arnaud, Deguest, Romain, and Munos, 2140–2146, 2017.
Rémi. Particle filter-based policy gradient in pomdps. In
NIPS, 2009. Le, Tuan Anh, Igl, Maximilian, Jin, Tom, Rainforth, Tom,
and Wood, Frank. Auto-encoding sequential Monte Carlo.
Deisenroth, Marc Peter and Peters, Jan. Solving nonlinear In ICLR, 2018.
continuous state-action-observation pomdps for mechani-
cal systems with gaussian noise. 2012. Maddison, Chris J, Lawson, John, Tucker, George, Heess,
Nicolas, Norouzi, Mohammad, Mnih, Andriy, Doucet,
Dhariwal, Prafulla, Hesse, Christopher, Klimov, Oleg, Arnaud, and Teh, Yee. Filtering variational objectives.
Nichol, Alex, Plappert, Matthias, Radford, Alec, Schul- In Advances in Neural Information Processing Systems,
man, John, Sidor, Szymon, and Wu, Yuhuai. Openai 2017.
baselines, 2017.
McAllester, David A and Singh, Satinder. Approximate
Doshi-Velez, Finale, Pfau, David, Wood, Frank, and Roy, planning for factored pomdps using belief state simpli-
Nicholas. Bayesian nonparametric methods for partially- fication. In Proceedings of the Fifteenth conference on
observable reinforcement learning. IEEE transactions Uncertainty in artificial intelligence, 1999.
Deep Variational Reinforcement Learning
McCallum, Andrew Kachites and Ballard, Dana. Reinforce- Ross, Stéphane, Pineau, Joelle, Chaib-draa, Brahim, and
ment learning with selective perception and hidden state. Kreitmann, Pierre. A bayesian approach for learning
PhD thesis, University of Rochester. Dept. of Computer and planning in partially observable markov decision
Science, 1996. processes. Journal of Machine Learning Research, 2011.
McCallum, R Andrew. Overcoming incomplete perception Shani, Guy, Brafman, Ronen I, and Shimony, Solomon E.
with utile distinction memory. In Proceedings of the Model-based online learning of pomdps. In European
Tenth International Conference on Machine Learning, pp. Conference on Machine Learning, pp. 353–364. Springer,
190–196, 1993. 2005.
Messias, João V and Whiteson, Shimon. Dynamic-depth Silver, David and Veness, Joel. Monte-carlo planning in
context tree weighting. In Advances in Neural Informa- large pomdps. In Advances in neural information pro-
tion Processing Systems, pp. 3330–3339, 2017. cessing systems, pp. 2164–2172, 2010.
A. Experiments
A.1. Implementation Details
In our implementation, the transition and proposal distributions pθ (zt |ht−1 , at−1 ) and qθ (zt |ht−1 , at−1 , ot ) are multivariate
normal distributions over zt whose mean and diagonal variance are determined by neural networks. For image data, the
decoder pθ (ot |zt , at−1 ) is a multivariate independent Bernoulli distribution whose parameters are again determined by a
neural network. For real-valued vectors we use a normal distribution.
When several inputs are passed to a neural network, they are concatenated to one vector. ReLUs are used as nonlinearities
between all layers. Hidden layers are, if not otherwise stated, all of the same dimension as h. Batch normalization was used
between layers for experiments on Atari but not on Mountain Hike as they significantly hurt performance. All RNNs are
GRUs.
Encoding functions ϕo , ϕa and ϕz are used to encode single observations, actions and latent states z before they are passed
into other networks.
To encode visual observations, we use the the same convolutional network as proposed by Mnih et al. (2015), but with only
32 instead of 64 channels in the final layer. The transposed convolutional network of the decoder has the reversed structure.
The decoder is preceeded by an additional fully connected layer which outputs the required dimension (1568 for Atari’s
84 × 84 observations).
For observations in R2 we used two fully connected layers of size 64 as encoder. As decoder we used the same structure as
for pθ (z| . . . ) and qφ (z| . . . ) which are all three normal distributions: One joint fully connected layer and two separated
fully connected heads, one for the mean, one for the variance. The output of the variance layer is passed through a softplus
layer to force positivity.
Actions are encoded using one fully connected layer of size 128 for Atari and size 64 for Mountain Hike. Lastly, z is
encoded before being passed into networks by one fully connected layer of the same size as h.
The policy is one fully connected layer whose size is determined by the actions space, i.e. up to 18 outputs with softmax for
Atari and only 2 outputs for the learned mean for Mountain Hike, together with a learned variance. The value function is
one fully connected layer of size 1.
A2C used ne = 16 parallel environments and ns = 5-step learning for a total batch size of 80. Hyperparameters were tuned
on Chopper Command. The learning rate of both DVRL and action-specific deep recurrent A2C network (ADR - A 2 C) was
independently tuned on the set of values {3 × 10−5 , 1 × 10−4 , 2 × 10−4 , 3 × 10−4 , 6 × 10−4 , 9 × 10−4 } with 2 × 10−4
being chosen for DVRL on Atari and 1 × 10−4 for DVRL on MountainHike and RNN on both environments. Without further
tuning, we set λH = 0.01 and λV = 0.5 as is commonly used.
As optimizer we use RMSProp with α = 0.99. We clip gradients at a value of 0.5. The discount factor of the control
problem is set to γ = 0.99 and lastly, we use ’orthogonal’ initialization for the network weights.
The source code will be release in the future.
1800 2000 30
1600 28
1500
Return
Return
Return
1400
26
1200 1000
24
1000 DVRL DVRL DVRL
RNN 500 RNN 22 RNN
800
0 2 4 0 2 4 0 2 4
Frames ×107 Frames ×107 ×107
Frames
0
4500
6000
−5
4000
Return
Return
Return
4000 −10
3500
0 2 4 0 2 4 0 2 4
Frames ×107 Frames ×107 Frames ×107
−4
300
−6 2000
250
Return
Return
Return
−8 1500
200
−10 1000
DVRL DVRL DVRL
RNN RNN RNN
0 2 4 0 2 4 0 2 4
Frames ×107 Frames ×107 Frames ×107
20
10
Return
−10
DVRL
RNN
−20
0 2 4
Frames ×107
(j) Pong
Figure 6: Training curves on the full set of evaluated Atari games, in the case of flickering and deterministic environments.
Deep Variational Reinforcement Learning
2000
30
1600
1500 28
Return
Return
Return
1400
26
1000
1200 24
DVRL 500 DVRL DVRL
RNN RNN 22 RNN
1000
0 2 4 0 2 4 0 2 4
Frames ×107 Frames ×107 ×107
Frames
4500 −5
6000
4000
Return
Return
Return
−10
3500 4000
3000 −15
2000
DVRL DVRL DVRL
2500 RNN RNN RNN
0 2 4 0 2 4 0 2 4
Frames ×107 Frames ×107 Frames ×107
2500
−6
250
2000
Return
Return
Return
−8
200 1500
−10 1000
DVRL DVRL DVRL
150 RNN RNN RNN
0 2 4 0 2 4 0 2 4
Frames ×107 Frames ×107 Frames ×107
20
10
Return
−10
DVRL
RNN
−20
0 2 4
Frames ×107
(j) Pong
Figure 7: Training curves on the full set of evaluated Atari games, in the case of flickering and stochastic environments.
Deep Variational Reinforcement Learning
Table 2: Final results on deterministic and flickering Atari environments, averaged over 5 random seeds. Bold numbers indicate statistical
significance at the 5% level when comparing DVRL and RNN. The values for DRQN and ADRQN are taken from the respective papers.
B. Algorithms
Algorithm 1 details the recurrent (belief) state computation (i.e. history encoder) for DVRL. Algorithm 2 details the recurrent
state computation for RNN. Algorithm 3 describes the overall training algorithm that either uses one or the other to aggregate
the history. Despite looking complicated, it is just a very detailed implementation of n-step A2C with the additional changes:
Inclusion of LELBO and inclusing of the option to not delete the computation graph to allow longer backprop in n-step A2C.
Results for also using the reconstruction loss LENC for the RNN based encoder aren’t shown in the paper as they reliably
performed worse than RNN without reconstruction loss.
Deep Variational Reinforcement Learning
(a) ChopperCommand
(b) Pong
(c) Bowling
(d) Centipede
(e) MsPacman
(f) BeamRider
Figure 8: Reconstructions and predictions using the learned generative model for several Atari games. First column: Current obseration
(potentially blank). Second column: Encoded and decoded reconstruction of the current observation. Columns 3 to 7: Predicted
observations using the learned generative model for timesteps dt ∈ {0, 1, 2, 3, 10, 30} into the future.
Deep Variational Reinforcement Learning