Trpo For Policy Gradient RL PDF

Policy Gradient in Partially Observable Environments:
Approximation and Convergence

Kamyar Azizzadenesheli, Yisong Yue, Anima Anandkumar
Department of Computing and Mathematical Sciences
arXiv:1810.07900v3 [cs.LG] 24 May 2020
California Institute of Technology, Pasadena

{kazizzad,yyue,anima}@caltech.edu
Abstract
Policy gradient is a generic and flexible reinforcement learning approach that generally enjoys
simplicity in analysis, implementation, and deployment. In the last few decades, this approach
has been extensively advanced for fully observable environments. In this paper, we generalize
a variety of these advances to partially observable settings, and similar to the fully observable
case, we keep our focus on the class of Markovian policies. We propose a series of technical
tools, including a novel notion of advantage function, to develop policy gradient algorithms and
study their convergence properties in such environments. Deploying these tools, we generalize
a variety of existing theoretical guarantees, such as policy gradient and convergence theorems,
to partially observable domains, those which also could be carried to more settings of interest.
This study also sheds light on the understanding of policy gradient approaches in real-world
applications which tend to be partially observable.
1 Introduction
Reinforcement learning (RL) is the study of sequential decision making under uncertainty with
a vast application in real-world problems such as robotics, ad-allocation, recommendation, and
autonomous vehicles systems. In RL, a decision-making agent interacts with its surrounding en-
vironment, collects experiences throughout the interaction, and develops its understanding of the
environment. The agent exploits this information to improve its behavior and strategies for further
exploration.1 One of the main problems in RL is the design of efficient algorithms that improve the
agent’s performance in high dimensional environments and maximize a notion of reward.
Recent advances in RL, particularly in deep RL (DRL), have shown major enhances in high-
dimensional RL problems. Among the most popularized ones, Abbasi-Yadkori and Szepesvári
[2011] addresses control in linear systems, Mnih et al. [2015] proposes Deep-Q networks and tack-
les arcade games [Bellemare et al., 2013], Silver et al. [2016] uses upper confidence bound tree
search [Kocsis and Szepesvári, 2006] and tackles board games, and Schulman et al. [2015] employs
policy gradient (PG) [Aleksandrov et al., 1968] and addresses continuous control problems, espe-
cially MuJoCo [Todorov et al., 2012] environments.
This paper is concerned with PG methods, which are among the prominent methods in high-
dimensional RL. PG methods are gradient optimization-based, and mainly approach RL as stochas-
tic optimization problems. These methods usually deploy Monte Carlo sampling to estimate the
1
Note that the described setting is one the primary setting of study in RL
1
gradient updates, resulting in high variance gradient estimations [Rubinstein, 1969]. Recent works
study Markov decision processes (MDP) and exploit the MDP structure to mitigate this high vari-
ance issue [Sutton et al., 2000, Kakade and Langford, 2002, Kakade, 2002, Schulman et al., 2015,
Lillicrap et al., 2015]. These works employ value-based methods, and provide low variance PG
methods, mainly in infinite horizon discounted reward MDPs, and pave the road to guarantee
monotonic improvements in the performance [Kakade and Langford, 2002]. However, they do not
offer optimality guarantees.
Understanding the convergence characteristics of PG methods, along with their global optimality
properties, and relationship with problem-specific dependencies, are essential for effective algorithm
design in real-world applications. Recently, a study by Fazel et al. [2018] advances the understanding
of the optimization landscape of the infinite horizon linear quadratic problems and proposes a set
of algorithms which are guaranteed to converge to the optimal control under a coverage assumption.
Further encouraging studies provide convergence analyses of PG related methods in MDPs mainly
using fixed-point convergence arguments [Liu et al., 2015, Bhandari and Russo, 2019, Agarwal et al.,
2019, Wang et al., 2019, Abbasi-Yadkori et al., 2019]. These analyses advance the understating of
PG methods related in infinite-horizon MDPs.
These recent developments of PG methods, both theoretical and empirical, are confined to
fully observable setting MDPs. However, in many real-world RL applications, e.g., robotic, drone
delivery, and navigation, the underlying states of the environments are hidden, i.e., they are more
appropriately modeled as partially observable MDPs. In POMDPs, the observation process need
not be Markovian, and only incomplete information of the underlying state of the environment is
accessible to the decision-makers.
In this work, we theoretically study PG methods in episodic POMDPs, and analyze their
convergence properties and optimality guarantees. We establish these results for both discounted
and undiscounted reward scenarios. We also develop a series of PG algorithms, agnostic to the
underlying environment model, fully (MDPs), or partially observable (POMDPs).2 To design algo-
rithms that are agnostic to the choice of the underlying model, we adopt the class of Markovian
policies [Puterman, 2014], which, under certain regularity conditions, are optimal for MDPs. We
propose a series of technical tools, including a novel notion of advantage function along with value
and Q functions for POMDPs, that we employ to analyze PG methods.
We analyze and generalize a range of MDP methods to the POMDP setting, such as trust
region policy optimization (TRPO) [Schulman et al., 2015], and proximal policy optimization (PPO)
[Schulman et al., 2017]. The resulting generalized TRPO (GTRPO) and generalized PPO (GPPO)
methods are among the few for POMDPs that are computationally feasible. We conclude our study
by showing how the tools developed in this work make it feasible to generalize a variety of existing
MDP based advanced algorithms to POMDPs. This study sheds light on the understanding of PG
approaches in real-world applications that are inherently partially observable.
Extended motivation, problem setting, and contribution.
In the following, we provide a detailed introduction of the setting of study in this paper, explain
hardness results, and conclude with a detailed explanation of our contributions. Table 1 catego-
2
Note that, since the class of MDPs is a subset to the class of POMDPs, under a proper choice of policy class,
designing algorithms for POMDPs suffices to have agnostic algorithms to the underlying model.
2
Table 1: While the majority of the prior works study MDPs and Markovian policies under dis-
counted reward and infinite horizon considerations, we focus our study on episodic problems, no
matter MDP or POMDP, Markovian or non-Markovian, and discounted or undiscounted.
Observability Policy Class Discounting Horizon
MDP Markovian Discounted Infinite
POMDP non-Markovian Undiscounted Episodic
rizes the majority of the RL paradigm concerning their observability, policy class, horizon, and
discounting factor.3
Episodic and infinite horizon: Many contemporary applications of RL are cast as episodic and/or
finite horizon problems. Episodic RL refers to settings where learning happens iteratively over
episodes, with each episode comprising running the current policy from a (often fixed) distribution
of initial states, collecting data, and updating the policy. Episodic RL is thus appropriate for
any sequential decision making tasks that are run in a repeated fashion. Infinite horizon refers to
casting the problem as running each episode over for infinite time steps. Episodic, finite-horizon
test-bed environments, such as MuJoCo, have played a major role in the recent empirical successes
of algorithm design in RL, despite many of those algorithms being designed for the infinite horizon
setting. While the study of infinite horizon environments is an essential topic of research, we dedicate
this paper to episodic environments.
Discounted and undiscounted reward : Another distinction, is whether to use discounted or undis-
counted cumulative rewards. In the discounted cumulative reward settings, rewards that are received
earlier in time are more favorable to those received later in time. In this setting, discounted cumu-
lative rewards are accumulations of rewards that are (typically exponentially) discounted through
time. In contrast, in the undiscounted cumulative reward settings, RL agents are indifferent to
when a unit of reward is received, as long as it is received. In repetitive related tasks with a clear
completion goal (e.g., cooking), we might be interested in the episodic undiscounted reward with
fixed horizons. In such tasks, we might be concerned with task accomplishment in a fixed period of
time. In contrast, long-horizon tasks with accumulated goals, such as vacuuming, we might rather
have a given physical area to be cleaned earlier than later without specifically specifying a task
horizon. Discounted rewards also favor completing tasks early rather than later, owing to putting
lower weight on rewards in future time steps. While the previous theoretical development of PG
methods [Schulman et al., 2015, Lillicrap et al., 2015] are mainly dedicated to the discounted re-
ward setting (also infinite horizon), we extend our study to both the discounted and undiscounted
reward settings.
Fully and partially observable: While PG can in principle be applied very generically, as men-
tioned above, previous research on advanced PG methods has larged focused on (fully observable)
MDPs. Many real-world learning and decision-making tasks, however, are inherently partially ob-
servable where an incomplete representation of the state of the environment is accessible. Owing to
the inability to directly observe the (latent) underlying states, learning in POMDP environments
poses significant challenges and also adds computational complexity to the resulting policy optimiza-
tion problem. Moreover, it is known that applying MDP based methods on POMDPs can result
in policies with arbitrarily bad expected returns [Williams and Singh, 1999, Azizzadenesheli et al.,
3
Note that this categorization is a coarse summary and does not capture all RL paradigms.
3
2017]. In this work, we study environment agnostic PG methods and provide convergence as well
as monotonic improvement guarantees for both MDPs and POMDPs.
Markovian and non-Markovian policies: Under a few conditions [Puterman, 2014], for any given
MDP, there exists an optimal policy which is deterministic and Markovian (that maps the agent’s
immediate or current observation of the environment to actions). In contrast, when dealing with
POMDPs, we might be interested in the class of non-Markovian and history-dependent policies
(that map the state-action trajectory history to distributions over actions). However, tackling
non-Markovian and history-dependent policies can be computationally undecidable [Madani et al.,
1999] for the infinite horizon, or PSPACE-Complete [Papadimitriou and Tsitsiklis, 1987] in the finite-
horizon POMDPs. To avoid undesirable computational burdens, we focus on Markovian policies.
Indeed, many prior works study Markovian policies for POMDPs [Baxter and Bartlett, 2001,
Littman, 1994, Baxter and Bartlett, 2000, Azizzadenesheli et al., 2016] where, in general, the opti-
mal policies are stochastic [Littman, 1994, Singh et al., 1994, Montufar et al., 2015]. Acknowledging
the computation complexity of Markovian policies [Vlassis et al., 2012], this line of work highlights
the broad interest, importance, and applicability of Markovian policies.
We are further interested in algorithms that are agnostic to the underlying model, as restricting
to such algorithm designs can also avoid certain undesirable computational burdens. We do so
by choosing Markovian policies, which are generally optimal for MDPs. This choice of the policy
class allows us to use that same class of functions to represent policies in POMDPs. We can thus
carry over PG results from MDPs directly to their POMDP counterparts. An additional practical
advantage is that it is straightforward to re-purpose well-developed software implementations of
MDP-based PG algorithms for the POMDP setting.
One could alternatively consider history-dependent policies, which is a richer function class than
Markovian policies. However, utilizing history-dependent policies requires turning a given POMDP
problem to a potentially non-stationary MDP with states as concatenations of the historical data.
As such, a thorough theoretical treatment of history-dependent policies for POMDPs would likely
require leveraging theoretical analyses for non-stationary MDPs.
Equivalence policy classes and the role of observation boundary: The expressiveness of general
policy classes is mainly entangled with the definition of the observation. Under some regularity
conditions, for any given class of history-dependent policies on a POMDP, there exists a class of
Markovian policies on a new POMDP such that: (1) the observations of the new POMDP are the
(typically) discrete concatenations of the historical data in the former POMDP; and (2) the two
policy classes are equivalent, i.e., the respective policies result in the same behavior (e.g., action
sequence). In other words, instead of making the policies non-Markovian and history-dependent,
we can keep them Markovian and instead enrich the observation space. Similarly, for any limited-
memory policy class (depending on a fixed window of history instead of the whole history, e.g., the
policies in MDPs of order more than one), there is an enrichment of the observation that results in
an equivalent class of Markovian policies. This observation-enrichment viewpoint is known as the
emergentism approach.4 In essence, these distinctions boil down to: “What a Markovian policy is
Markovian with respect to”.
The above reasoning implies that, by considering the class of Markovian policies in episodic
4
In contrast, the reductionism approach argues for the opposite viewpoint, and views the class of Markovian
policies is a subset of the class of non-Markovian and history-dependent policies, simply by ignoring the historical
data and considering just the immediate/current observation for decision making. While both viewpoints are valid,
they differ in exercising the definition of observations.
4
POMDPs, we often will not restrict the generality of the results in this paper. Therefore, despite
focusing on Markovian policies, our results also hold for both classes of limited-memory as well as
history-dependent policies through representing histories as observations.
Detailed Summary of Contributions. In this paper, we study Markovian policies in episodic

POMDPs and MDPs, with both discounted and discounted rewards.
(i) We state the general PG theorems in the POMDP framework. These theorems are based on
are well-known theorems that make no specific modeling assumption on the underlying environments.
We extend the value-based PG theorems on MDP, such as deterministic PG [Silver et al., 2014], to
POMDPs. While the extensions are straightforward, we provide them for clarity and completeness.
(ii) We define a novel notion of advantage function for POMDPs and advance a series of MDP-
based theoretical results to POMDPs. In MDPs, the advantage functions are well defined, and
they are functions of one event (i.e., one state-action pair). However, the states in POMDPs
are partially observed, and the classical definition of advantage functions does not directly carry
to partially observable cases. For POMDPs, our advantage function depends on three consecutive
events instead of just one. Utilizing this definition, we generalize the policy improvement guarantees
in MDPs [Kakade and Langford, 2002, Schulman et al., 2015] to POMDPs. Note that conditioning
on three consecutive events for learning in POMDPs matches the results in Azizzadenesheli et al.
[2016], which indicate that the statistics of three consecutive events are sufficient to learn POMDP
dynamics and guarantee order-optimal sample complexity properties of Markovian policies.
(iii) We study how to extend natural policy gradient methods for MDPs to POMDPs. This
derivation leverages our novel notion of advantage function.
(iv) We propose generalized trust region policy optimization (GTRPO) PG methods, as a gen-
eralization of Kakade and Langford [2002], Schulman et al. [2017] to the general classes of episodic
MDPs and POMDPs under both discounted and undiscounted reward settings. GTRPO utilize the
property of newly developed advantage function, and computes a low variance estimation of the PG
updates. We then construct a trust region for the policy search step in GTRPO and show that the
policy updates of GTRPO are guaranteed to monotonically improve the expected return. GTRPO
is among the few methods for POMDPs that are computationally feasible.
(v) We propose generalized proximal policy optimization (GPPO), a generalization of the PPO
algorithm [Schulman et al., 2017] to the general classes of episodic MDPs and POMDP under both
discounted and undiscounted reward settings. GPPO is an extension to GTRPO, as PPO is an
extension of TRPO, which is known to be computation more efficient than its predecessor.
We conclude the paper by stating that the developed machinery in this paper, can be used to
extend a variety of existing MDP-based methods directly to their general POMDP cases.
2 Preliminaries
An episodic stochastic POMDP, M , is a tuple hX , A, Y, P1 , T, R, O, γ, xT erminal i with a set of latent
states X , observations Y, actions A, and their elements denoted as x ∈ X , y ∈ Y, and a ∈ A,
accompanied with a discount factor 0 ≤ γ ≤ 1. Under reasonable assumptions,5 let P1 (·) : X →
5
We assume that the quantities of interests are measurable, and Lebesgue-Stieljes integrable with respect to
their corresponding kernels on their respective Borel measurable spaces, among other regularity assumptions, such
as compactness of the action sets, complete and separability of X , Y spaces, as well as continuity of value related
functions. For simplicity, wherever possible, we also omit the proper measure theoretic definition of the stochastic
5
xh xh+1 xh+2
rh rh+1 rh+2
yh yh+1 yh+2
ah ah+1 ah+2
Figure 1: POMDP under a Markovian policy
[0, ∞] represent the probability distribution of the initial latent states, T (·|·, ·) : X × X × A → [0, ∞]
denotes the transition kernel on the latent states, O(·|·) : Y × X → [0, ∞] the emission kernel,
R(·|·, ·) : R × Y × A → [0, ∞] the reward kernel.6 Let the kernel map π(·|·) : A × Y → [0, ∞]
denote a Markovian policy. For simplicity in the notation, we encode the time step h into the state,
observation, and action representations, x, y, a. Consider the following event in the stochastic
process ruled by M and a policy π, in short, a trajectory:
τ := {(x1 , y1 , a1 , r1 ), (x2 , y2 , a2 , r2 ), . . . , (x|τ | , y|τ | , a|τ | , r|τ | ), x|τ |+1 , y|τ |+1 },
where |τ | is the random length of the trajectory. In episodic setting, a policy π interacts with the
environment as follows (extended definitions follow the description of the interaction protocol): The
policy interacts with the environment in episodes:
1. The policy starts at an initial state sampled from the initial state distribution, x1 ∼ P1 .
2. For each time step h ≥ 1, the policy receives observations yh ∼ O(·|xh ).
3. If xh = xT erminal , then |τ | is h − 1, and end the episode.
4. The policy then takes action ah ∼ π(·|yh ), and receives reward rh ∼ R(·|yh , ah ).
5. The environment transitions to a new state xh+1 ∼ T (·|xh , ah ).
6. h ← h + 1, repeat from Step (a).
Since M is an episodic environment, P(|τ | < ∞) = 1. As the definition of episodic environments

indicates, xT erminal is reachable in finite random time |τ | + 1, i.e., |τ | + 1 < ∞ almost surely.
Correspondingly, define yT erminal := y|τ |+1 as the terminal observation. For any given pairs of
(y, a) ∈ Y × A, R(y, a) := E[R(y, a)] denotes a version of conditional expected of reward where the
expectation is with respect to the randomness of reward given y and a. We assume R(·, ·) is a finite
function. Fig. 1 depicts a Markovian policy acting on a POMDP.
We consider a set P of parameterized Markovian policies πθ where θ ∈ Θ, with Θ an appropriate
compact euclidean space, i.e., a compact subset of finite Cartesian product of space of reals.7 Each
processes as well as the generality of functions classes, unless necessary.
6
In an alternative definition, the reward kernel follows the latent states, i.e., R(·|·, ·) : R × X × A → [0, ∞].
7
For simplicity in the derivations, we assume that the value related functions in this paper are continuous in Θ.
6
policy πθ in P is also assumed to be continuously differentiable with respect to parameter θ ∈ Θ,
with finite gradients (space of W 1,∞ ). For any given parameter θ, the construction of M results
in a set of trajectories, such that τ ∈ Υ and (Υ, F, Pθ ) is a complete measure space with F, the
corresponding σ-algebra. For simplicity, let ∀τ ∈ Υ, f (τ ; θ) denote the probability distribution of
trajectories under policy πθ , i.e.,
|τ |
Y
f (τ ; θ)dτ := dPθ = P1 (x1 ) O(yh |xh )πθ (ah |yh )R(rh |xh , ah )T (xh+1 |xh , ah )dτ.
h=1
For a trajectory τ , a random variable R(τ ) represent the cumulative γ-discounted rewards in τ ,
P|τ |
i.e., R(τ ) = h=1 γ h−1 rh . Therefore, the unnormalized expected cumulative return of πθ is:
Z
η(θ) := Eτ |θ [R(τ )] = f (τ ; θ)R(τ )dτ −→ θ ∗ ∈ arg min η(θ). (1)
τ θ∈Θ
We use Eτ |θ , Eτ ∼f (·;θ) , Eπθ , interchangeably for expectation with respect to the dPθ . Let θ ∗ denote
a parameter that maximizes η(θ), i.e., θ ∗ ∈ arg maxθ∈Θ η(θ), and π ∗ = π(θ ∗ ) an optimal policy. The
optimization problem to find an optimal policy in the policy class, in general, is non-convex and
hill climbing methods such as gradient ascent may not necessary converge to the optimal behavior.
3 Policy Gradient
PG is a generic RL approach which directly optimizes for policies without requiring explicit con-
struction of the environment model or the value functions. Owing to its simplicity, PG is used in
a variety of practical and theoretical setting. Theoretical analyses of this approach, known as PG
theorems, indicate that there is no need to have an explicit knowledge of the environment model, M ,
to compute the gradient of η(θ), Eq. 1, with respect to θ as long as we can sample returns from the
environment. These theorems elaborate that a Monte Carlo sampling of returns suffices for the gra-
dient computation [Rubinstein, 1969, Williams, 1992, Baxter and Bartlett, 2001].8 In this section,
we re-derive these theorems, but this time explicitly for POMDPs under Markovian policies using
two approaches: i : through construction of the score functions, ii : through importance sampling
which is based on the change of measures and score functions. We emphasize on the importance
sampling approach since it is more relevant to the later parts of this paper.
In the Subsections 3.1, 3.2, and 3.3, we extend the theoretical analyses of standard PG theorems
to the partially observed settings. These extensions are straightforward, and present them for
completeness to set up our main theoretical contributions in later Sections.
3.1 Score Function

As mentioned before, the gradient of the expected return can be estimated without explicit knowl-
edge of the model dynamics and is agnostic to the realization of the underlying model. We re-
establish this statement for POMDP using score functions. Employing dominated convergence
theorem, and the definition of η(θ) in Eq. 1 we have,
8
In this paper we employ the measure-theoretic definition of the gradient, and consider large enough parameters
space to avoid boundary related discussions.
7
Z
∇θ η(θ) = ∇θ f (τ ; θ)R(τ )dτ
Z τ
= ∇θ f (τ ; θ)R(τ )dτ
Zτ
= f (τ ; θ)∇θ log(f (τ ; θ))R(τ )dτ,
τ
where for a trajectory τ , the score function is the gradient of the log(f (τ ; θ)) with respect to θ, i.e.,

|τ | |τ |
∇θ log(f (τ ; θ)) = ∇θ log P1 (x1 )Πh=1 O(yh |xh )R(rh |xh , ah )T (xh+1 |xh , ah ) + ∇θ log Πh=1 πθ (ah |yh )
Since the first part of the right hand side of the above equation is independent of θ, its derivative
with respect to θ is zero. Therefore we have,
Z
|τ |
∇θ η(θ) = f (τ ; θ)∇θ log Πh=1 πθ (ah |yh ) R(τ )dτ.
τ ∈Υ
Following πθ , generate m trajectories {τ 1 , . . . , τ m } with elements (xth , yht , ath , rht ), and terminals
∀h ∈ {1, . . . , |τ t |} and ∀t ∈ {1, . . . , m}. Deploying the Monte Carlo sampling approximation of the
1 Pm t
integration, the empirical mean of the return is ηb(θ) = m t=1 R(τ ), and the estimation of the
gradient is as follows,
m
1 X t
|τ |
∇θ ηb(θ) = ∇θ log Πh=1 πθ (ath |yht ) R(τ t ). (2)
m
t=1
This estimation is an unbiased and consistent estimation of the gradient and does not require the
knowledge of the underlying dynamic except through samples of cumulative reward R(τ ).
3.2 Importance Weighting

We derive the policy gradient theorem using techniques following the change of measures. We esti-
mate η(θ ′ ), the expected return of policy πθ′ using trajectories induced by πθ assuming the absolute
continuity of imposed probably measure by θ ′ with respect to θ’s. Again, using the definition of
η(θ) in Eq. 1 we have:
Z
η(θ ′ ) = Eτ |θ′ [R(τ )] = f (τ ; θ ′ )R(τ )dτ
Zτ ∈Υ
f (τ ; θ ′ ) f (τ ; θ ′ )
= f (τ ; θ) R(τ ) dτ = Eτ |θ R(τ ) . (3)
τ ∈Υ f (τ ; θ) f (τ ; θ)
Again, using dominated convergence theorem, we compute the gradient of η(θ ′ ) with respect to θ ′ :

′ ∇θ′ f (τ ; θ ′ ) f (τ ; θ ′ ) ′
∇θ′ η(θ ) = Eτ |θ R(τ ) = Eτ |θ ∇θ′ log(f (τ ; θ ))R(τ ) .
f (τ ; θ) f (τ ; θ)
If we take a limit θ ′ → θ, and evaluate this gradient at θ ′ = θ, we have;
∇θ′ η(θ ′ ) |θ′ = θ = Eτ |θ [∇θ log(f (τ ; θ))R(τ )] . (4)
8
Following
a similar argument
as in the Subsection 3.1, for a trajectory τ we have ∇θ log(f (τ ; θ)) =
|τ |
∇θ log Πh=1 πθ (ah |yh ) . Given m trajectories {τ 1 , . . . , τ m }, with elements (xth , yht , ath , rht ), and and
terminals, ∀h ∈ {1, . . . , |τ t |} and ∀t ∈ {1, . . . , m} generated following πθ , we estimate the gradient
∇θ η(θ) as follows, which is same expression as we derived using score function,
m
1 X t
|τ |
∇θ ηb(θ) = ∇θ log Πh=1 πθ (ath |yht ) R(τ t ). (5)
m
t=1
The gradient estimation using plain Monte Carlo samples incorporates samples of R(τ t ) which a long
sum of random reward. Consequently, this estimation may result in a high variance estimation and,
in practice, might not be an approach of interest. As it is evident, the PG theorem is too powerful
and, in expectation, is almost independent of the underlying stochastic process. In the next section,
we describe how one can exploit the structure in the underlying POMDP environment to develop a
PG method that mitigates the high variance shortcoming of direct Monte Carlo sampling.
3.3 Value-base Policy Gradient Theorem

To mitigate the effect of high variance gradient estimation, one might be interested in exploiting the
environment structure and deploy the value-based methods. Let τh′ ..h denote the events in τ from
the time step h′ up to the time step h, including the event at h. At each time step h, for any pair
of (xh , ah ) ∈ X × A, which is also feasible under M , define Qθ (ah , xh ) and V θ (xh ) as the standard
Q and value function following the policy induced by θ, i.e.,
|τ |
hX i |τ |
hX i
h′ −h ′
V θ (xh ) := Eπθ γ rh′ |xh , Qθ (ah , xh ) := Eπθ γ h −h rh′ |xh , ah .
h′ =h h′ =h
To ease the reading, we drop the necessary conditioning on the event h ≤ |τ | in our notation,
while keeping its importance in mind. For a given policy π ∈ P induced by θ, define µθ as its
representative policy or action9 , also known as randomized decision rule [Puterman, 2014], on the
hidden states, i.e., for any (xh , ah ) pair and all h,
Z
µθ (ah |xh ) := O(yh |xh )πθ (ah |yh )dyh .
yh
Note that applying the above bounded linear operator on πθ results in µθ (xh ) which is the action
probability distribution at state xh . For any time step h, and xh ∈ X, consider the space of all action
probability distribution generated by µθ (xh ) when sweeping θ over Θ, i.e., range of the mentioned
linear operator when applied on P. Let ah denote a point in this space. For any time step h, we
define Qe θ , the Q value of committing ah at state xh , and following πθ afterward. More formally, at
a time step h, state xh , and θ ′ ∈ Θ, consider ah = µθ′ (xh ),
Z Z
e θ (ah , xh ) :=
Q Qθ (ah , xh )µθ′ (ah |xh )dah = Qθ (ah , xh )O(yh |xh )πθ′ (ah |yh )dyh dah .
ah yh ,ah
9
In this paper, we assume that representative policies live in a Borel spaces on finite dimensional Euclidean spaces.
9
e θ (ah , xh )|a =µ (x ) . Similarly, we also define the
If we choose ah = µθ (xh ), then V θ (xh ) := Q h θ h
corresponding expected reward and the transition kernel under ah = µθ′ (xh ) as follows,10
Z
r(ah , xh ) := R(ah , yh )πθ′ (ah |yh )O(yh |xh )dyh dah ,
yh ,ah
Z
T (xh+1 |ah , xh ) := T (xh+1 |ah , xh )µθ′ (ah |xh )dah ,
yh ,ah
Therefore, for any ah , the dynamic program on the underlying latent states is,
Z
e θ (ah , xh ) = r(ah , xh ) + γ
Q T (xh+1 |ah , xh )V θ (xh+1 )dxh+1 .
xh+1
Now we extend the main PG theorems in Sutton et al. [2000] and Silver et al. [2014] to episodic
POMDPs, derived by simple modifications in their proof predecessors.
Theorem 3.1 (Policy Gradient). For a given policy πθ on a POMDP M ;

Z |τ |
X
∇θ η(θ) = γ h−1 f (τ1..h−1 , xh , yh ; θ)∇θ πθ (ah |yh )Qθ (ah , xh )dτ,
τ ∈Υ h=1
and:
Z |τ |
X
∇θ η(θ) = γ h−1 f (τ1..h ; θ)∇θ log (πθ (ah |yh )) Qθ (ah , xh )dτ.
τ ∈Υ h=1
Proof of the Thm 3.1 in Subsection A.1.
Theorem 3.2 (Deterministic Policy Gradient). For a given policy µθ on a POMDP M ;

Z |τ |
X
∇θ η(θ) = eθ (ah , xh )|a =µ (x ) dτ
γ h−1 f (τ1..h−1, xh ; θ)∇θ µθ (xh )⊤ ∇ah Q h θ h
τ ∈Υ h=1
Proof of the Thm 3.2 in Subsection A.2.
Remark 3.1. By setting the emission kernel O to the identity map, the Thms. 3.1 and 3.2 reduce
to their MDP cases.
These results show how the knowledge of the underlying state can help to compute the gradient
of ∇θ η(θ). Of course in practice, under partial observability circumstances, we do not have access
to the latent state, unless we consider the whole history as our current observation. We provided
the above mentioned results, for the sake of completeness, and latter we utilize their statements,
proofs, and derivations.
10
For an MDP based such definitions, please refer to Puterman [2014], e.g., Eq. 2.3.2.
10
3.4 Advantage-based Policy Gradient Theorem
In the following, we denote πθ as the current policy which is the policy that we run to collect trajec-
tories and πθ′ as a new policy that we are interested in evaluating its performance. Generally, when
a class of Markovian policies is considered for POMDPs, neither Q nor V functions on observation-
action pairs are well-defined as they are for MDPs through the Bellman equation. In the following,
we define two new quantities, similar to the Q and V of MDPs, for POMDPs. We adopt the same
notation as Q and V for these two new quantities.
Definition 3.1 (V and Q values in POMDPs ). For a given POMDP M , the V and Q functions
of a Markovian policy π are as follows:
|τ |
hX i
′
Vπ (yh , ah−1 , yh−1 ) :=Eπ γ h −h rh′ |yh , yh−1 , ah−1 ,
h′ =h
|τ |
hX i
′
Qπ (yh+1 , ah , yh ) :=Eπ γ h −h rh′ |yh+1 , yh , ah . (6)
h′ =h
For h = 0 we relax the notation on yh−1 in Vπ and simply denote it as Vπ (y1 , −, −) where the symbol
− is a member of neither the set Y nor the set A.
Now consider the special case where the observation is equal to the latent state, and reduce the
problem to an MDP. In this case we denote the value and Q functions as Vπ (xh ) = V π (xh ), and
Qπ (ah , xh ) = Qπ (ah , xh ). For a given MDP, the advantage function [Baird III, 1993] of a policy π
is defined as the following, for all h, xh , and ah ,
Aπ (xh , ah ) := Qπ (ah , xh ) − Vπ (xh ).
Similarly, such advantage function is not well-defined on observations in POMDPs. In the following,
we define a new quantity for POMDPs and call it again advantage function. We also adopt the A
notation for the advantage function of POMDPs.
Definition 3.2 (Advantage function in POMDPs ). For a given POMDP, the advantage function
of a Markovian policy π on any tuple {yh+1 , ah , yh , ah−1 , yh−1 } is defined as follows,
Aπ (yh+1 , ah , yh , ah−1 , yh−1 ) = Qπ (yh+1 , ah , yh ) − Vπ (yh , ah−1 , yh−1 ).
Similar to the definition of the value function V , for h = 0 we relax the notation on yh−1 in Aπ and
simply denote it as Aπ (y2 , a1 , y1 , −, −)
These choice of such value, Q, and A function becomes clear in the following.
Lemma 3.1. The difference between the expected return of a policy πθ′ and of a policy πθ , denoted
as improvement in the expected return is as follows,
|τ |
X
η(π ) − η(πθ ) = Eτ ∼πθ′
θ′ γ h−1 Aπθ (yh+1 , ah , yh , ah−1 , yh−1 ).
h=1
Proof of the Lem. 3.1 in Subsection A.3.
11
Remark 3.2. The Lem. 3.1 is the POMDP generalization of Lem. 6.1. in Kakade and Langford
[2002] which is for MDPs. This Lem. also recovers the Lem. 6.1. in Kakade and Langford [2002]
if the environment is in fact an MDP.
For any time step h, state xh , and θ ′ , θ ∈ Θ, consider ah = µθ′ (xh ), we further define,
eπ(θ) (xh , ah )|a =µ (x ) := V π (xh ) − Q
A e π (ah , xh )|a =µ (x )
h θ′ h h θ′ h
e defined, we have,
Having A
Corollary 3.1. Given an POMDP M , two policy parameters θ ′ , θ ∈ Θ, and action representation
policy of θ ′ , i.e., ah = µθ′ (xh ), we have
|τ |
X
η(πθ′ ) − η(πθ ) = Eτ ∼πθ′ eπ(θ) (xh , ah )|a =µ (x )
γ h−1 A h θ′ h
h=1
The equality derived in Lem. 3.1 suggests that if we have the advantage function of the current
policy πθ and sampled trajectories from πθ′ , we could compute the improvement in the expected
return η(πθ′ )−η(πθ ). Then we could maximize this quantity over θ ′ , or potentially, directly maximize
the expected return η(πθ′ ) without incorporating πθ , to find an optimal policy. But in practice, we
mainly do not have sampled trajectories for all new policies πθ′ to accomplish the maximization
task. Instead, we have sampled trajectories from the current policy, πθ , and bound to the best use
of them. To address this gap, we defined the following surrogate objective function,
|τ |
X
Lπθ (πθ′ ) := η(πθ ) + Eτ ∼πθ γ h−1 Ea′h ∼πθ′ (·|yh ) Aπθ (yh+1 , a′h , yh , yh−1 , ah−1 ). (7)
h=1
One can compute this surrogate objective using sampled trajectory generated by following the
current policy πθ , advantage function induced by the same policy πθ , and evaluating the advantage
function by choosing actions a′h according to πθ′ . Therefore, we can maximize the surrogate objective
function Eq. 7 over θ ′ just using sampled trajectories and advantage function of the current policy
πθ . The question is how maximizing or increasing Lπθ (πθ′ ) is related to our quantity of interest
η(πθ′ ). In the following, we study this relationship. The gradient of Lπθ (πθ′ ) with respect to θ ′ ,
then evaluated at θ is expressed as following,
|τ |
X
∇θ′ Lπθ (πθ′ )|θ′ =θ = ∇θ′ η(πθ ) + Eτ ∼πθ γ h−1 Ea′h ∼πθ′ (ah |yh ) Aπθ (yh+1 , a′h , yh , ah−1 , yh−1 ) |θ′ =θ
h=1
Z |τ |
X Z
= γ h−1 f (τ1..h , xh+1 , yh+1 ; θ) ∇θ′ πθ′ (a′h |yh )|θ′ =θ Aπθ (yh+1 , a′h , yh , ah−1 , yh−1 )da′h dτ.
τ ∈Υ h=1 a′h
(8)
We later use Eq. 8 for the convergence analysis of PG methods. For any pairs of two policies, πθ
and πθ′ , and any time step h and observations yh , consider ǎh = πθ′ (yh ) as the representative action
of πθ′ at observation yh , we have
12
Z
Aπθ (yh+1 , ǎh , yh , ah−1 , yh−1 ) = Aπθ (yh+1 , ah , yh , ah−1 , yh−1 )πθ′ (ah |yh )dah . (9)
ah
Note that we are using the same notation A for both primitive actions ah and representative
actions ǎh .
Theorem 3.3 (Surrogate Deterministic Policy Gradient). Given a POMDP M , and the advantage
function induced by policy πθ , the gradient of the surrogate objective of πθ′ , ∇θ′ Lπθ (πθ′ ), with respect
to θ ′ , evaluated at θ is as follows:
∇θ′ Lπθ (πθ′ )|θ′ =θ
Z |τ |
X
= γ h−1 f (τ1..h, xh+1 , yh+1 ; θ)∇θ′ πθ′ (yh )|⊤
θ ′ =θ ∇ǎh Aπθ (yh+1 , ǎh , yh , ah−1 , yh−1 )|ǎh =πθ (yh ) dτ
τ ∈Υ h=1
Proof of the Thm. 3.3 in Subsection A.4.

Note that when the problem is fully observable and MDPs, marginalizing over xt + 1 and yt + 1
at each time step h in the theorem statement reduces the gradient of surrogate function to its MDP
case. In MDP setting, we have ∇θ′ Lπθ (πθ′ )|θ′ =θ = ∇θ′ η( πθ′ )|θ′ =θ , which is one the key component
in analyses of PG methods in MDPs [Kakade and Langford, 2002, Schulman et al., 2015].
3.5 Gradient Dominance and Surrogate Function

In the following we study the convergence properties of PG methods on surrogate function Lπθ (πθ′ ).
Consider the following operator, as an alternative to Bellman max operator:
Definition 3.3. Given a POMDP M and an advantage function Aπθ , define the optimizing operator
T as follows:
Z X
|τ |
T Aπθ = max γ h−1 f (τ1..h , xh+1 , yh+1 ; θ)
′ θ τ h=1
(Aπθ (yh+1 , πθ′ (yh ), ah , yh , yh−1 , ah−1 ) − Aπθ (yh+1 , πθ (yh ), ah , yh , yh−1 , ah−1 )) dτ.
with θ ⋆ (θ) a maximizer and, πθ⋆ (θ) , its corresponding policy.
T Aπθ represents maximum improvement one can make on Lπθ (πθ′ ) by changing θ ′ . Consider
the following set of assumptions.
Assumption 3.1. For every time step h, and tuples of yh+1 , yh , yh−1 , ah−1 , Aπθ (yh+1 , ǎh , yh , yh−1 , ah−1 )
is a concave function of ǎh .
Assumption 3.2. For any θ, there exist a vector ̟(θ) in the space of observation representative
∂
policies, such that for any time step h and observation yh , ∂α πθ+α̟(θ) (yh )|α=0 = πθ⋆ (θ) (yh )− πθ (yh ).
Theorem 3.4 (Convergence of PG on Surrogate Objective). Given a POMDP M , under the
Asms. 3.1, and 3.2, for a policy πθ , and a corresponding function ̟(θ), we have,
T Aπθ ≤ ∇α Lπθ (πθ+α̟(θ) )|α=0 .
13
Corollary 3.2 (Optimally and Surrogate Objective). Since T Aπθ is always non negative, when the
gradient ∇θ′ Lπθ (πθ′ )|θ′ =θ , is zero, the T Aπθ is equal to zero.
The Thm. 3.4 is the POMDP generalization of the main theorem in Bhandari and Russo [2019],
which is developed for MDPs. In the case of full observable setting, the Thm. 3.4 reduces to its MDP
counterpart and can be seen by marginalizing xh+1 and yh+1 for each times step h in the derivation.
The deployed assumptions in the statement of the Thm. 3.4 are limiting, but the resulting statement
is strong. In the following we relax these assumptions.
3.6 Convergence Guarantee and Distribution Mismatch Coefficient

In the following we study the converges guarantee of PG methods without the assumptions in the
previous subsection. We show that convergence takes place in presence of coefficients related to the
richness of policy class, and how far the underlying POMDP is from being an MDP.
Definition 3.4 (Distribution mismatch coefficient). Given any pairs of parameters θ and θ ′ , the
infinity norm of ratio of measures f (τ : θ)dτ , and f (τ : θ ′ )dτ ,11 denotes the distribution mismatch
:θ ′ )
coefficient between πθ and πθ′ , i.e., k ff(τ
(τ :θ) k∞ with respect to the measure space (Υ, F, Pθ ).
′
The distribution mismatch coefficient for (θ, θ ′ ) express how representative the trajectories gen-
erated by following π(θ) are for the trajectories generated by following π(θ ′ ). In the following we
∗
use this notation for (θ, θ ∗ ), i.e., k ff(τ(τ:θ:θ)) k∞ , on how representative are samples of π(θ) for π(θ ∗ ).
The definition of distribution mismatch coefficient up to a slight modification is an POMDP gener-
alization its MDP counterpart [Antos et al., 2008, Agarwal et al., 2019]. For a given parameter θ
and an optimal parameter θ ∗ , let θ + ∈ Θ denote a maximizer to the following optimization,12
Z X
|τ |
+
θ ∈ arg max γ h−1 f (τ1..h−1 , xh , yh ; θ ∗ )πθ′′ (ah |yh )
θ ′′ ∈Θ τ h=1
T (xh+1 |xh , ah )O(yt+1 |xt+1 )Aπθ (yh+1 , ah , yh , ah−1 , yh−1 )dτ. (10)
Note that the objective on this optimization is positive since setting πθ′′ = πθ∗ reduces the right
hand side to η(θ ∗ ) − η(θ), and therefore, if θ = θ ∗ , we have η(θ ∗ ) = η(θ + ). Considering the
actions of policy πθ+ on trajectories generated using τ ∼ f (·; θ), we define an indicator function
e(xh , ah−1 , yh−1 ), such that for each h, and xh , ah−1 , yh−1 , e(xh , ah−1 , yh−1 ) is equal to 1 if,
Z
O(yt |xt )πθ+ (ah |yh )T (xh+1 |xh , ah )O(yh+1 |xh+1 )Aπθ (yh+1 , ah , yh , ah−1 , yh−1 )dyh dah dxh+1 dyh+1 ≥ 0,
yh ,ah ,xh+1 ,yh+1
and zero otherwise. In tabular MDPs, with tabular policies, since one can choose actions that make
the evaluated advantage function at each state to be positive, e(xh , ah−1 , yh−1 ) = e(xh , ah−1 , xh−1 )
is always equal to one, which follows by the definition of θ + . For each tuple of (ah , yh , xh , ah−1 , yh−1 ),
we define πθ++ ,θ which keep parts of trajectories that contributed in positive increments in Eq. 10:
πθ++ ,θ (ah , yh , xh , ah−1 , yh−1 ) := e(xh , ah−1 , yh−1 )πθ+ (ah |yh ), (11)
11
In general, it need not to be with respect to Lebesgue.
12
One should use θ+ (θ, θ∗ ) notation, but we adopt θ+ for simplicity since we are defining it for a given θ, θ∗ .
14
where πθ++ ,θ , while being a kernel from the measure theoretic point of view, it might not necessarily
be a probability kernel. We note that πθ++ ,θ reduced to π(θ + ) in tabular MDPs. Therefore the
difference between πθ++ ,θ and π(θ + ) accounts for both partial observably of the system, as well as
the expressively of the function class used for polices. In the following we consider the performance
of π(θ) and how much πθ++ ,θ improves this performance:
Z X
|τ |
+ +
∆η (θ , θ) := γ h−1 f (τ1..h−1 , xh , yh ; θ)πθ++ ,θ (ah , yh , xh , ah−1 , yh−1 )
τ h=1
T (xh+1 |xh , ah )O(yh+1 |xh+1 )Aπθ (yh+1 , ah , yh , ah−1 , yh−1 )dτ

Z X
|τ |
− γ h−1 f (τ1..h , xh+1 , yh+1 ; θ)Aπθ (yh+1 , ah , yh , ah−1 , yh−1 )dτ
τ h=1
Z X
|τ |
= γ h−1 f (τ1..h−1 , xh , yh ; θ) πθ++ ,θ (ah , yh , xh , ah−1 , yh−1 ) − πθ (ah |yh )
τ h=1
T (xh+1 |xh , ah )O(yh+1 |xh+1 )Aπθ (yh+1 , ah , yh , ah−1 , yh−1 )dτ. (12)
Using these pieces, finally, we define Bellman policy error ǫ+θ , and its corresponding compatibil-
ity vector, ω + , two problem dependent quantities in Def. 3.5. These parameters summarize the
expressivity of the policy class, and the closeness of the POMDP to an MDP behavior.
Definition 3.5 (Bellman policy error). Bellman policy error ǫ+ θ is the error of a best linear ap-
proximation under mean absolute error to ∆η + (θ + , θ) under feature vector ∇θ πθ (ah |yh ) ∈ Rd of
observation-action pairs,
|τ | Z
+ + +⊤
X
ǫ+
θ := min ∆η (θ , θ) − ω Eτ ∼π(θ) γ h−1
∇θ πθ (a′h |yh )Aπθ (yh+1 , a′h , yh , ah−1 , yh−1 )da′h ,
ω∈Rd a′h
h=1
where ωθ+ , a compatibility vector of Bellman policy error, as a minimizer.

Since there might be multiple πθ+ as a result of Eq. 10 maximization, one can consider a πθ+
which minimizes ǫ+ +
θ . Using the approximation error ǫθ , and the distribution mismatch coefficient,
we have the following theorem.
Theorem 3.5 (Gradient dominance and distribution mismatch coefficient). Given a POMDP M ,
parameter set Θ and the corresponding measure space (Γ, Υ, Pθ ), for any θ we have,
f (τ : θ ∗ ) f (τ : θ ∗ ) ⊤
η(θ ∗ ) − η(θ) ≤ ǫ+
θk k∞ + k k∞ ω + ∇θ′ Lπθ (πθ′ )|θ′ =θ .
f (τ : θ) f (τ : θ)
Corollary 3.3 (Optimally and Bellman Policy Error). The Thm. 3.5 indicates that following policy
gradient on the surrogate objective Lπθ (πθ′ ) to the point that either the gradient ∇θ′ Lπθ (πθ′ )|θ′ =θ is
a vector with a small norm, policies are compatible (i.e., small norm on ω + ), or small value on
their inner product, and importantly small approximation error ǫ+ (θ), as long as the distribution
mismatch is finite, then the performance of π(θ) is close to that of π(θ ∗ ).
15
The results in the Thm. 3.4, as well as the definitions of Bellman policy error and compatibility
vector are the extension of the mains results, in the section 6 of the recent paper by Agarwal et al.
[2019]. For the discussion of a similar theorem in MDP case, please refer to the mentioned paper.
4 Trust Region Policy Optimization

We now study how to formally extend trust region-based policy optimization methods to POMDPs.
Generally, the notion of gradient depends on the parameter metric space. Given a pre-specified
Riemannian metric, a gradient direction is defined. When the metric is Euclidean, the notion of
gradient reduces to the standard gradient [Lee, 2006]. The Riemannian generalization of the notion
of gradient adjusts the standard gradient direction based on the local curvature induced by the
Riemannian manifold of interest. A valuable knowledge of the curvature assists in finding an ascent
direction, which might conclude to big ascend in the objective function. This approach is also
interpreted as a trust region method where we are interested in assuring that the ascent steps do
not change the objective beyond a safe region where the local curvature might not stay valid. In
general, a valuable manifold might not be given, and we need to adopt one. Fortunately, when the
objective function is an expectation over a parameterized distribution, Amari [2016] recommends
employing the Riemannian metric, induced by the local Fisher information. This choice of metric
results in a well knows notion of the gradient, known as natural gradient. For the objective function
in the Eq. 1, the Fisher information matrix is defined as the following,
Z h i
F (θ) := f (τ ; θ) ∇θ log (f (τ ; θ)) ∇θ log (f (τ ; θ))⊤ dτ. (13)
τ ∈Υ
Deployment of natural gradients in PG methods, at least goes back to Kakade [2002] in MDPs.
The direction of the gradient with respect to F -induced metric is derived as F (θ)−1 ∇θ (η(θ)). One
can compute the inverse of this matrix to come up with the direction of the natural gradient.
Since neither storing the Fisher matrix is always bearable, nor computing the inverse is practically
desirable, direct utilization of F (θ)−1 ∇θ (η(θ)) might not be a feasible option.
A classical approach to estimate F −1 ∇θ (η) is based on compatible function approximation meth-
ods. Kakade [2002] study this approach in the context of MDPs. In the following, we develop this
approach for POMDPs. Consider a feature map φ(τ ) in a desired ambient space. We approximate
the return R(τ ) via a linear function of the feature representation φ(τ ), i.e.,
Z
2
min ǫ(ω) := f (τ, θ) φ(τ )⊤ ω − R(τ ) dτ.
ω τ
which is a convex program with a minimizer ω ∗ . To find the optimal ω, we take the gradient of ǫ(ω)
with respect to ω, and solve it for zero vector, i.e.,
Z
0 = ∇ω ǫ(ω)|ω=ω∗ = 2f (τ, θ)φ(τ )[φ(τ )⊤ ω ∗ − R(τ )]dτ.
τ
For the optimality,

Z Z
⊤ ∗
f (τ, θ)φ(τ )φ(τ ) ω dτ = f (τ, θ)φ(τ ; θ)R(τ )dτ.
τ τ
16

|τ |
If we consider the φ(τ ) = ∇θ log Πh=1 πθ (ah |yh ) , the LHS of this equality is F (θ)ω ∗ . Therefore
F (θ)ω = ∇θ η(θ) =⇒ ω ∗ = F (θ)−1 ∇ρ.
This is one way to compute F (θ)−1 ∇ρ. Another approach to estimating the natural gradient relies on
the computation of the KL divergence. This approach, that also has been utilized in TRPO, suggests
to first deploy DKL divergence substitution technique and then use conjugate gradient procedure
to tackle the computation and storage bottlenecks of direct computation of natural gradient. In the
following, we analyze this approach in the episodic settings.
Lemma 4.1. Under a set of mild regularity conditions,
∇2θ′ DKL (θ, θ ′ )|θ′ =θ = F (θ), (14)

R
with DKL (θ, θ ′ ) := − τ ∈Υ f (τ ; θ) log (f (τ ; θ
′ )/f (τ ; θ)) dτ , and ∇2θ′ the Hessian with respect to θ ′ .
The Lem. 4.1 is a known lemma in the literature, and we provide its proof in the subsection A.7.
In practice, it might not be feasible to compute the expectation in either the Fisher information
matrix or the DKL , but rather their empirical estimates. Given m trajectories:
XX m |τ t |
bKL (θ, θ ′ )|θ′ =θ 1 πθ′ (ath |yht )
∇2θ′ D = − ∇2θ′ log .
m πθ (ath |yht )
t=1 h=1
This empirical estimate is the same for both MDPs and POMDPs. The analysis in most of the
celebrated PG methods, e.g., TRPO, PPO, are dedicated to infinite horizon MDPs, while almost
all the experimental studies in this line of work are in the episodic settings. Therefore the estimator
used in these methods,
|τ |
m X
X
t
bKL 1 πθ′ (ath |yht )
∇2θ′ D T RP O
(θ, θ ′ )|θ′ =θ = − Pm t ∇2θ′ log ,
t |τ | πθ (ath |yht )
t=1 h=1
is a biased estimator to the DKL in episodic settings, which is the DKL required in the mentioned
studies. This bias, for example, can result in a dramatic failure of TRPO construction of the trust
region and prevent it from fulfilling its monotonic improvement promise. While this bias might
results in degradation of empirical performance, of course, there are scenarios that this bias, in fact,
helps to improve the performance. In this paper, we are concerned with properties that are aligned
with theoretical guarantees.
One can skip the following subsection, with no serious harm in the subjects afterward. In the
following subsection, we provide a discussion on ∇2θ′ DKL vs. ∇2θ′ DKL T RP O .
4.1 DKL vs. DKL T RP O

The use of DKL instead of DKL T RP O is motivated by theory and also intuitively recommended.
A small change in the policy at the beginning of short episodes does not make a drastic shift in
the distribution of the trajectory but might cause radical shifts when the trajectory length is long.
Therefore, for longer horizons, the trust region needs to shrink.
17
For the sake of simplicity, recognize the construction of trust region as DKL ≤ δ for some desired
δ. Consider two trajectories, one long and one short. The DKL ≤ δ induces a region which allows
greater changes in the policy for short trajectory while limiting changes in long trajectory. While
DKL T RP O ≤ δ induces the region, which is indifference to the length of trajectories and looks at
each sample as it is experienced in a stationary distribution of an infinite horizon MDP. In other
words, it allows the same amount of change in policy for short episodes as it gives to long ones.
Consider a toy RL problem where at the beginning of the learning, when the policy is not good,
the agent dies at the early stages of the episodes (termination). In this case, the trust region under
DKL is vast and allows for substantial change in the policy parameters, while again, DKL T RP O does
not consider the length of the episode. On the other hand, toward the end of the learning, when the
agent has learned a good policy, the length of the horizon grows, and small changes in the policy
might cause drastic changes in the trajectory distribution. Therefore the trust region, under DKL ,
shrinks again, and just small changes in the policy parameters are allowed, which is again captured
by DKL but not at all by DKL T RP O .
It is worth restate that one can cook up an example that DKL T RP O construction is helpful, but
it is not the point of this section since it is not what the TRPO guarantees and analysis promise.
Generally, there more issues with treating episodic RL problems as infinite horizon problems, and
we refer reader to Thomas [2014] for more in this topic.
4.2 Construction of the Trust Region

In practice, depending on the problem at hand, either of the discussed approaches for computing
the natural gradient can be applicable. For the construction of trust region, one can exploit the
close relationship between DKL and Fisher information matrix Lem. 4.1 and also the fact that the
Fisher matrix is equal to second order Taylor expansion of DKL . Instead of considering the area
k (θ − θ ′ )⊤ F (θ − θ ′ ) k2 ≤ δ, or k (θ − θ ′ )⊤ ∇2θ′ DKL (θ, θ ′ )|θ′ =θ (θ − θ ′ ) k2 ≤ δ for the construction of
the trust region, we can approximately consider DKL (θ, θ ′ ) ≤ δ/2. These relationships between
these three approaches in constructing the trust region is used throughout this paper.
To complete the study of KL divergences, we propose a discount-factor-dependent divergence and
provide the monotonic improvement guarantee with respect to DKL as well as this new divergence.
The DKL divergence and Fisher information matrix in Eq. 14, Eq. 13 do not convey the effect
of the discount factor. Consider a setting with a small discount factor γ. In this setting, we do
not mind drastic distribution changes in the latter part of episodes since there achieved reward is
strongly discounted anyway. Therefore, we desire to have an even wider trust region and allow
bigger changes for latter parts of the trajectories. This is a valid intuition, and in the following, we
derive a DKL divergence by also incorporating γ. We rewrite η(θ) as follows,
Z Z |τ |
X
η(θ) = f (τ ; θ)R(τ )dτ = f (τ1..h ; θ)γ h−1 rh (τ )dτ.
τ ∈Υ τ ∈Υ h=1
Following the Amari [2016] reasoning for Fisher information of each component of the sum, we
derive a γ-dependent divergence Dγ (θ, θ ′ ),
X r
1 2
Dγ (θ, θ ′ ) := γ h−1 DKL τ1..h ∼ f (·; θ ′ ), τ1..h ∼ f (·; θ) . (15)
2
h≥1
18
Algorithm 1 GTRPO
1: Initial πθ , ǫ′ , δ ′
2: Choice of divergence D: DKL or Dγ
3: for epoch = 1 until convergence do
4: Deploy policy πθ and collect experiences
5: Estimate the advantage function A bπ
θ
6: Construct empirical estimates of the surrogate objective L b
b π (πθ′ ) and D
θ
7: Update θ using the

θ ∈ arg max b π (πθ′ ) s.t. 1 k θ ′ − θ ⊤ ∇2′′ D(θ,
L b θ ′′ )|θ′′ =θ θ ′ − θ k2 ≤ δ′
θ θ
θ ′ 2
8: end for
This divergence penalizes the distribution mismatch less in the latter parts of trajectories through
discounting them. Similarly, taking into account the relationship between KL divergence and Fisher
information we also define γ-dependent Fisher information, Fγ (θ),
Z X h i
Fγ (θ) := γ h f (τ1..h ; θ) ∇θ log (f (τ1..h; θ)) ∇θ log (f (τ1..h; θ))⊤ dτ.
τ ∈Υ h=1
We develop our trust region study upon both definitions of D and Dγ .
4.3 Generalized Trust Region Policy Optimization

We propose Generalized Trust Region Policy Optimization (GTRPO), a generalization of MDP-
based trust region methods in Kakade and Langford [2002], Schulman et al. [2015] to episodic en-
vironments, agnostic to whether MDP or POMDP. We utilize the proposed notion of advantage
function, prove the monotonic improvement properties of GTRPO, and show how the KL diver-
gences, Dγ and DKL , play their roles in the construction of trust regions.
As illustrated in Alg. 1, GTRPO employs its current policy πθ to collect trajectories, estimate the
advantage function using collected samples. GTRPO deploys the estimated advantage function to
derive the surrogate objective. In order to come up with a new policy, GTRPO maximizes the surro-
gate objective over policy parameters in the vicinity of the current policy, defined through the trust
region. The underlying procedures in GTRPO are similar to its predecessor TRPO except, instead of
maximizing the surrogate objective defined over unobserved hidden state dependent advantage func-
tion, Aπθ (ah , xh ) , it maximizes the surrogate objective defined using Aπθ (yh+1 , ah , yh , yh−1 , ah−1 ).
Note that if the underlying environment is an MDP, Aπθ (yh+1 , ah , yh , yh−1 , ah−1 ) is equivalent to
Aπθ (xh+1 , ah , xh ) where after marginalizing out xh+1 in the expectation we end up with Aπθ (ah , xh )
and recover TRPO. In practice, one can estimate the advantage function Aπθ (yh+1 , a, yh , yh−1 , ah−1 )
by approximating Qπθ (yh+1 , a, yh ) and Vπθ (yh , yh−1 , ah−1 ) using data collected by following πθ and
function classes of interest.
In the following we show that maximizing Lπθ (πθ′ ) over θ ′ results in a lower bound on the
improvement η(πθ′ ) − η(πθ ) when πθ and πθ′ are close under DKL or Dγ . Consider the maximum
19
spans of advantage function, ǫ and ǫ′ , such that for all θ ∈ Θ, the following inequalities hold,
|τ |
X
ǫ ≥ max γ h−1 Aπθ (yh+1 , πθ (yh ), µθ (yh ), yh , yh−1 , ah−1 ),
τ ∈Υ
h
ǫ ≥ max Aπθ (yh+1 , πθ (yh ), yh , yh−1 , ah−1 ).
′
τ ∈Υ
Using the quantifies of ǫ and ǫ′ , we have the following monotonic improvement guarantees,
Theorem 4.1 (Monotonic Improvement Guarantee). For two πθ and πθ′ , construct Lπθ (πθ′ ), then
1)η(πθ′ ) ≥ Lπθ (πθ′ ) − ǫT V (τ ∼ f (·; πθ′ ), τ ∼ f (τ ; πθ )) − ǫ

r
1
≥ Lπθ (πθ′ ) − ǫ DKL (πθ′ , πθ ) − ǫ,
2
r
X 1
′ h−1
2)η(πθ′ ) ≥ Lπθ (πθ′ ) − ǫ γ DKL (τ1..h ∼ f (·; πθ′ ), τ1..h ∼ f (·; π)) − ǫ
2
h≥1
q
≥ Lπθ (πθ′ ) − ǫ′ Dγ (πθ , πθ ) − ǫ,
where ǫ is the advantage gap Eq.30 and is zero in case MDPs.

Proof of the Thm. 4.1 in Subsection A.8. The Thm. 4.1 recommends optimizing Lπθ (πθ′ ) over
πθ′ in the vicinity of πθ , defined by DKL or Dγ . More formally, Thm. 4.1 results suggest to maximize
the shifted lower bound on η(πθ′ ), i.e., either,
r q
1
Lπθ (πθ′ ) − ǫ DKL (πθ′ , πθ ), or Lπθ (πθ′ ) − ǫ′ Dγ (πθ , πθ ),
2
and accept the new parameter if either of these objective values is above the lowered expected
return of the current parameter, η(θ)−ǫ, resulting in monotonic improvement. Using the relationship
between the KL divergence and Fisher information, as well as the applications of interest in practice,
we propose the following alternative optimization procedures. In practice, given the current policy
πθ , one might be interested in either of the following optimization,
p q
′
max L (π
πθ θ ′ ) − C DKL (π ,
θ θπ ′ ), or max L (π
πθ θ ′ ) − C Dγ (πθ , πθ′ )),
′θ ′ θ
where C and C ′ are the problem and application dependent constants. We also can view the C and
C ′ as the knobs to restrict the trust region denoted by δ, δ′ and construct the following constraint
optimization problems,
max Lπθ (πθ′ ) s.t. DKL (πθ , πθ′ ) ≤ δ, or max Lπθ (πθ′ ) s.t. Dγ (πθ , πθ′ )) ≤ δ′ .
′θ ′θ
Furthermore, we can approximate these constraints up to their second-order Taylor expansion and
come up with:
1 ⊤ 1 ⊤
max Lπθ(πθ′) s.t. k θ ′ − θ F θ ′ − θ k2 ≤ δ, or max Lπθ(πθ′ ) s.t. k θ ′ − θ Fγ θ ′ − θ k2 ≤ δ′ ,
θ
′ 2 θ ′ 2
20
which results in the Alg. 1. We choose to provided the above mentioned derivation in the expression
of the Alg. 1 since the constraints in the optimization can be imposed using the mentioned conjugate
gradient techniques. Alg. 1 is an extension of TRPO algorithm to the present setting.
We hope that these analyses shed light on how to extend MDP based methods to the general
class of POMDPs, and what are some of the important components crucial for considerations. As
mentioned in the introduction, the tools developed in this paper can be used to extend a variety of
existing advanced MDP-based methods to POMDPs. For an example, Thm. 3.3 and 4.1 are directly
applicable to generalize the constrained policy optimization approaches [Achiam et al., 2017] to
general episodic environments.
4.4 Generalized PPO:

The celebrated PG algorithm PPO is an extension to TRPO algorithm which has some favorable
properties compared to its predecessor. Broadly speaking, PPO framework promises a more desir-
able computation and statistical advantages. Following the recipe of extending TRPO to PPO, in
the following Generalized PPO (GPPO), practically more pleasant extension of GTRPO.
Usually, in high dimensional but low sample settings, constructing the trust region is hard due
to high estimation errors. Many samples are required to have a meaningful construction of the
trust region. It is even harder, especially when the region depends on the inverse of the estimated
Fisher matrix or when the optimization is constrained with an upper threshold on KL divergence.
Therefore, trusting the estimated trust region is questionable. While TRPO requires concrete
construction of the trust region in the parameter space, its final goal is to keep the new policy
close to the current policy, i.e., small DKL (πθ , πθ′ ) or Dγ (πθ , πθ′ ). Proximal Policy Optimization
(PPO) is instead proposed to impose the structure of the trust region directly onto the policy space.
This method approximately translates the constraints developed in TRPO on the parameters space,
directly to the policy space, in other mean the action space. PPO optimization subroutine penalizes
the gradients of the objective function when the policy starts to operate beyond the region of trust
by setting the gradient to zero. PPO optimizes for the following objective function,13
πθ′ (a|x) πθ′ (a|x)

E[min{ Aπθ (a, x), clip( ; 1 − δL , 1 + δU )Aπθ (a, x)}].
πθ (a|x) πθ (a|x)
If the advantage function is positive and the importance weight is above 1 + δU this objective
function saturates. When the advantage function is negative and the importance weight is below
1 − δL this objective function saturates again. In either case, when the objective function saturates,
the gradient of this objective function is zero; therefore, further development in that direction is
obstructed. This approach, despite its simplicity, approximates the trust region effectively and
substantially reduce the computation cost of TRPO.
Following the TRPO, the clipping trick ensures that the importance weight, derived from esti-
π (a|y)
mation of DKL does not go beyond a certain limit | log πθθ′(a|y) | ≤ ν when it favors, i.e.,
πθ′ (a|y)
1 − δL := exp (−ν) ≤ ≤ 1 + δU := exp (ν), (16)
πθ (a|y)
13
In the original PPO paper, δU = δL .
21
Algorithm 2 GPPO
1: Initial πθ , k, λ
h ): (δ , δ ), or (δ h , δ h ),
2: Choice of (δlh , δu L U L U
3: for epoch = 1 until convergence do
4: Deploy policy πθ and collect experiences
5: Estimate the advantage function A bπ
θ
6: b
Construct the empirical estimate Lπθ (πθ′ , δlh , δuh ) of the following objective function:
h n π ′ (a |y )
θ h h
Lπθ (πθ′ , δlh , δuh ) := E min Aπ (yh+1 , ah , yh , yh−1 , ah−1 ),
πθ (ah |xh ) θ
πθ′ (ah |xh ) oi
clip( ; 1 − δlh , 1 + δuh )Aπθ (yh+1 , ah , yh , yh−1 , ah−1 )
πθ (ah |yh )
7: Initialize θ ′ = θ
8: for k steps do
9: b π (πθ′ ) or any other ascent algorithm
Deploy parameter update θ ′ ← θ ′ + λ∇θ′ L θ
10: end for
11: Set θ = θ ′
12: end for
depending on the sign of the advantage function. As discussed in the subsection. 4.1, we propose
a principled change in the clipping such that it matches the def of KL divergence in Lem. 4.1 and
conveys the information in the length of episodes; | log ππθ′(a|y) ν
(a|y) | ≤ |τ | ; therefore for α := exp (ν)
θ
πθ′ (a|y)
1 − δL := α−1/|τ | ≤ ≤ 1 + δU := α1/|τ | . (17)
πθθ (a|y)
This change ensures more restricted clipping for longer trajectories, while wider for shorter ones.
Moreover, as suggested in Thm. 4.1, and the definition of Dγ (πθ , πθ′ ) in Eq. 15, we propose an
extension in the clipping based on information about the discount factor. Following the prescription
π (a|y) ν
Dγ (πθ , πθ′ )) ≤ δ′ , for a sample at time step h of an episode, we have | log πθθ′(a|y) | ≤ |τ |γ h . Therefore:
ν πθ′ (a|y) ν
1−δLh := exp (− ) ≤ ≤ 1 + δUh := exp ( ),
|τ |γ h πθ (a|y) |τ |γ h
h πθ′ (a|y) h
→ α−1/|τ | α−1/γ ≤ ≤ α1/|τ | α1/γ . (18)
πθ (a|y)
As it is interpreted, for deeper parts in an episode, we make the clipping softer and allow for larger
changes in policy space. This means we are more restricted at the beginning of trajectories compared
to the end of trajectories. The choice of γ and α are critical here. In practical implementations of
RL algorithms, as also theoretically suggested by Jiang et al. [2015], Lipton et al. [2016], we usually
choose discount factors smaller than the one depicted in the problem. Therefore, the discount
factor that we use in practice is smaller than the true one, especially when we deploy function
approximation. Therefore, instead of keeping γ h in Eq. 18, since the true γ in practice is unknown
22
and can be arbitrary close to 1, we substitute it with a maximum value:
πθ′ (a|y)
1 − δLh := max{α−1/(|τ |γ ) , 1 − β} ≤ ≤ 1 + δUh := min{α1/(|τ |γ ) , 1 + β}.
h h
(19)
πθ (a|y)
The modification proposed in series of equations Eq. 16, Eq. 17, Eq. 18, and Eq. 19 provide
insight in the use of trust regions in both MDPs and POMDPs based PPO. The GPPO objective
for any choice of δUh and δLh in MDPs is:
h n π ′ (a |x ) πθ′ (ah |xh ) oi
θ h h
E min Aπθ (ah , xh ), clip( ; 1 − δLh , 1 + δUh )Aπθ (ah , xh ) , (20)
πθ (ah |xh ) πθ (ah |xh )
while for POMDPs, GPPO optimizes the following,
h n π ′ (a |y )
θ h h
Lπθ (πθ ) := E min Aπ (yh+1 , ah , yh , yh−1 , ah−1 ),
πθ (ah |xh ) θ
πθ′ (ah |xh ) oi
clip( ; 1 − δLh , 1 + δUh )Aπθ (yh+1 , ah , yh , yh−1 , ah−1 ) , (21)
πθ (ah |yh )
resulting in algorithm 2. In order to make the existing MDP-based PPO suitable for POMDPs,
we just substitute Aπθ (ah , xh ) with Aπθ (yh+1 , ah , yh , yh−1 , ah−1 ) in the codebases. But generally, an
extra care is required when turning MDP based implementation to POMDP based ones, since for
POMDPs, ∇θ′ Lπθ (πθ′ )|θ′ =θ might not aligned with ∇θ′ η(θ ′ )|θ′ =θ as it is the case for MDPs.
Note: In this section, we explained how the developed technical tools in this paper could be
carried to generalize a variety of MDP based methods and analysis. For that purpose, we explained
how to generalize TRPO and PPO to the POMDP setting. These two extensions are two examples
to illustrate the road map of extending MDP-based methods to POMDPs.
It is worth noting that the goal if this paper is to provide theoretical tools to study POMDPs,
rather than developing new algorithms for partially observable environments. While we show how
these tools can be used to generalize a few MDP-based methods, empirical examination of these
generalized methods are not aligned with the study of this paper. We also leave the generalization
of the vast literature on MDP-based methods for later studies.
5 Conclusion
In this paper, we studied the algorithmic and convergences properties of PG methods in the general
class of fully and partially observable environments, under both discounted and undiscounted reward.
We propose novel notions of value, Q, and advantage functions for POMDPs. We generalize the
MDP based PG theorems to POMDPs. We study the behaviors of polices on the underlying states
of POMDPs. We provide convergence analyses of optimizing the surrogate functions using the novel
notion of advantage function. We propose a new notation of trust-region in episodic environments
and show that the current practice of considering infinite horizon formulation of the trust region,
theoretically, is not aligned with the premises of the prior works in episodic environments. To
mitigate this shortcoming, we also propose the construction of the discount-factor-dependent trust
region.
23
We argue that the developed technical tools in this paper to analyze POMDPs are generic tools
and can be carried to generalize and extend a variety of existing MDP based analyses and methods
to POMDPs. For that purpose, we show how to extend TRPO and PPO methods to POMDPs. We
propose GTRPO, the first trust region policy optimization-based algorithm, which enjoys monotonic
improvement guarantees on POMDPs. We further extend our study and propose GPPO, i.e., an
analog to PPO in POMDPs. These extensions are evident in the generality of the developed tools
in this paper.
Acknowledgements
K. Azizzadenesheli gratefully acknowledge the financial support of Raytheon, NSF Career Award
CCF-1254106 and AFOSR YIP FA9550-15-1-0221. A. Anandkumar is supported in part by Bren
endowed chair, DARPA PAIHR00111890035 and LwLL grants, Raytheon, Microsoft, Google, and
Adobe faculty fellowships.
24
A Appendix
A.1 Proof of the Thm. 3.1
For a given policy πθ , lets restate the value function at a given state x1 :
Z
V θ (x1 ) = πθ (a1 |y1 )O(y1 |x1 )Qθ (a1 , x1 )dy1 da1
a1 ,y1
For this value function, we compute the gradient with respect to parameters θ, i.e., ∇θ V θ (x1 )
Z
∇θ V θ (x1 ) = ∇θ πθ (a1 |y1 )O(y1 |x1 )Qθ (a1 , x1 )dy1 da1
a1 ,y1
Z
+ πθ (a1 |y1 )O(y1 |x1 )∇θ Qθ (a1 , x1 )dy1 da1
a1 ,y1
To further expand this gradient, consider the definition of Qθ (a1 , x1 ) using Bellman equation,
Z
Qθ (a1 , x1 ) = E[r1 |x1 , a1 ] + γ V θ (x2 )T (x2 |x1 , a1 )dx2
x2
R
where E[r1 |x1 , a1 ] = y1 O(y1 |x1 )R(y1 , a1 )dy1 , resulting in,
Z
∇θ Qθ (a1 , x1 ) = γ ∇θ V θ (x2 )T (x2 |x1 , a1 )dx2
x2
since the first term in the definition of Qθ (a1 , x1 ) does not depend on the parameters θ. Following
these steps, we derive ∇θ V θ (x2 ),
Z
∇θ V θ (x2 ) = ∇θ πθ (a2 |y2 )O(y2 |x2 )Qθ (a2 , x2 )dy2 da2
a2 ,y2
Z
+ πθ (a2 |y2 )O(y2 |x2 )∇θ Qθ (a2 , x2 )dy2 da2
a2 ,y2
We recursively compute the gradient of the value functions for later time steps and conclude that,
Z
∇θ V θ (xh ) = ∇θ πθ (ah |yh )O(yh |xh )Qθ (ah , xh )dyh dah
ah ,yh
Z Z !
+γ πθ (ah |yh )O(yh |xh ) ∇θ V θ (xh+1 )T (xh+1 |xh , ah )dxh+1 dyh dah
ah ,yh xh+1
and repeating this decomposition results in

Z
∇θ η(θ) = P1 (x1 )∇θ V θ (x1 )dx1
Z x1
= P1 (x1 )∇θ πθ (a1 |y1 )O(y1 |x1 )Qθ (a1 , x1 )dy1 da1 dx1
a1 ,y1 ,x1
Z Z
+γ P1 (x1 )πθ (a1 |y1 )O(y1 |x1 ) ∇θ V θ (x2 )T (x2 |x1 , a1 )dx2 dy1 da1 dx1
a1 ,y1 ,x1 x2
Z X
|τ |
= γ h−1 f (τ1..h−1 , xh , yh ; θ)∇θ πθ (ah |yh )Qθ (ah , xh )dτ
τ h=1
25
which is the first statement of the theorem.
For the second statement we have:
Z X
|τ |
∇θ η(θ) = γ h−1 f (τ1..h−1 , xh , yh ; θ)∇θ πθ (ah |yh )Qθ (ah , xh )dτ
τ h=1
Z X
|τ |
= γ h−1 f (τ1..h−1 , xh , yh ; θ)πθ (ah |yh )∇θ log (πθ (ah |yh )) Qθ (ah , xh )dτ
τ h=1
Z X
|τ |
= γ h−1 f (τ1..h ; θ)∇θ log (πθ (ah |yh )) Qθ (ah , xh )dτ
τ h=1

Proof. For any time step h, let’s restate the Q, e that is the value of committing ah at state xh ,
then following the policy induced by θ afterward. At a time step h, state xh , and θ ′ ∈ Θ, consider
ah = µθ′ (xh ),
Z Z
e θ (ah , xh ) =
Q Qθ (ah , xh )µθ′ (ah |xh )dah = Qθ (ah , xh )O(yh |xh )πθ′ (ah |yh )dyh dah
ah yh ,ah
Given this definition, we have that the value function is V θ (x1 ) = Q e θ (µθ (x1 ), x1 ), and using
Bellman equation we have
Z
e
Qθ (ah , xh ) = r(ah , xh ) + γ T (xh+1 |ah , xh )V θ (xh+1 )dxh+1
xh+1
for ah in the set covered by µθ (xh ) given all θ.

e θ (µθ (xh ), xh ) with respect to parameters θ,
Using this definition, we compute the gradient of Q
e θ (µθ (xh ), xh ) = ∇θ µθ (xh )⊤ ∇a r(ah , xh )|a =µ (x )

∇θ Q h h θ h
Z
+γ ∇θ µθ (xh )⊤ ∇ah T (xh+1 |ah , xh )|ah =µθ (xh ) Q eθ (µθ (xh+1 ), xh+1 )dxh+1
xh+1
Z
+γ T (xh+1 |µθ (xh ), xh )∇θ Q e θ (µθ (xh+1 ), xh+1 )dxh+1
xh+1
Z
= ∇θ µθ (xh )∇ah r(ah , xh ) + γ eθ (µθ (xh+1 ), xh+1 )dxh+1 |a =µ (x )
T (xh+1 |ah , xh )Q h θ h
xh+1
Z
+γ T (xh+1 |µθ (xh ), xh )∇θ Q e θ (µθ (xh+1 ), xh+1 )dxh+1
xh+1
Z
⊤
= ∇θ µθ (xh ) ∇ah Q e θ (ah , xh )|a =µ (x ) + γ eθ (µθ (xh+1 ), xh+1 )dxh+1
T (xh+1 |µθ (xh ), xh )∇θ Q
h θ h
xh+1
therefore similar to the proof of the Thm. 3.1 we have,
26
Z
∇θ η(θ) = P1 (x1 )∇θ V θ (x1 )dx1
x1
Z Z
= e θ (a1 , x1 )|a =µ (x ) + γ
P1 (x1 )∇θ µθ (x1 )⊤ ∇a1 Q e
P1 (x1 )T (x2 |µθ (x1 ), x1 )∇θ Qθ (µθ (x2 ), x2 )dx1 dx2
1 θ 1
x1 x1 ,x2
Z X
|τ |
= eθ (ah , xh )|a =µ (x ) dτ
γ h−1 f (τ1..h−1, xh ; θ)∇θ µθ (xh )⊤ ∇ah Q h θ h
τ h=1
which is the statement of the theorem.
A.3 Proof of the Lem. 3.1

In the following we use the fact that for every time step, tuples of observations, and action
Eπ [rh |yh+1 , ah , yh ] = E [rh |ah , yh ]. We utilize this equality and restate the relationship between
V and Q in a POMDP as follows:
Qπ (yh+1 , ah , yh ) := Eπ [rh |yh+1 , ah , yh ] + γVπ (yh+1 , ah , yh ). (22)
Given the definition of advantage function, we derive,
|τ |
X
Eτ ∼πθ′ γ h−1 Aπθ (yh+1 , ah , yh , ah−1 , yh−1 )
h=1
|τ |
X
= Eτ ∼πθ′ γ h−1 [Qπθ (yh+1 , ah , yh ) − Vπθ (yh , ah−1 , yh−1 )]
h=1
|τ |
X
= Eτ ∼πθ′ γ h−1 [Eπθ [rh |yh+1 , ah , yh ] + γVπθ (yh+1 , ah , yh ) − Vπθ (yh , ah−1 , yh−1 )]
h=1
|τ | |τ |
X X
h−1
= Eτ ∼πθ′ γ E [rh |ah , yh ] + Eτ ∼πθ′ γ h−1 [γVπθ (yh+1 , ah , yh ) − Vπθ (yh , ah−1 , yh−1 )]
h=1 h=1
|τ | |τ |
X X
h−1
= Eτ ∼πθ′ γ E [rh |ah , yh ] + Eτ ∼πθ′ γ h−1 [γVπθ (yh+1 , ah , yh ) − Vπθ (yh , ah−1 , yh−1 )]
h=1 h=1
|τ |
X
= Eτ ∼πθ′ γ h−1 E [rh |ah , yh ] − E [Vπθ (y0 , −, −)]
h=1
|τ |
X
= Eτ ∼πθ′ γ h−1 E [rh |ah , yh ] − η(πθ )
h=1
= η(πθ′ ) − η(πθ )
which is the statement of the Lemma.
27
As stated in Eq. 9, for all time steps h, observation tuples yh−1 , yh , yh+1 , ah−1 , θ, and θ ′ , we have,
Z
Aπθ (yh+1 , ǎh , yh , ah−1 , yh−1 )|ǎh =πθ′ (yh ) = Aπθ (yh+1 , ah , yh , ah−1 , yh−1 )πθ′ (ah |yh )dah
ah
Using the definition of Ǎπθ and the surrogate objective Eq. 7, we state the following:
|τ |
X
Lπθ (πθ′ ) = η(πθ ) + Eτ ∼πθ γ h−1 Ea′h ∼πθ′ (ah |yh ) Aπθ (yh+1 , a′h , yh , ah−1 , yh−1 )
h=1
|τ |
X
= η(πθ ) + Eτ ∼πθ γ h−1 Aπθ (yh+1 , ǎh , yh , ah−1 , yh−1 )|ǎh =πθ′ (yh )
h=1
Z X
|τ |
= η(πθ ) + γ h−1 f (τ1..h , xh+1 , yh+1 ; θ)Aπθ (yh+1 , ǎh , yh , ah−1 , yh−1 )|ǎh =πθ′ (yh ) dτ
τ h=1
By taking the gradient of the surrogate objective with respect to θ ′ , we have:

Z X
|τ |
∇ Lπθ (π ) = ∇
θ′ θ′ θ′ γ h−1 f (τ1..h , xh+1 , yh+1 ; θ)Aπθ (yh+1 , ǎh , yh , ah−1 , yh−1 )|ǎh =πθ′ (yh ) dτ
τ h=1
Z X
|τ |
= γ h−1 f (τ1..h , xh , yh ; θ)∇θ′ πθ′ (yh )⊤ ∇ǎh Aπθ (yh+1 , ǎh , yh , ah−1 , yh−1 )|ǎh =πθ′ (yh ) dτ
τ h=1
which results in the statement of the theorem by evaluating ∇θ′ Lπθ (πθ′ ) at θ = θ ′ .

For a parameter θ, its corresponding ̟(θ), and an α, consider the surrogate objective:
Z X
|τ |
Lπθ (πθ+α̟(θ) ) = η(πθ ) + γ h−1 f (τ1..h , xh+1 , yh+1 ; θ)Aπθ (yh+1 , ǎh , yh , ah−1 , yh−1 )|ǎh =πθ+α̟(θ) (yh ) dτ
τ h=1
Therefore for the directional gradient we have:

Z X
|τ |
∇α Lπθ (πθ+α̟(θ) ) = ∇α γ h−1 f (τ1..h , xh+1 , yh+1 ; θ)Aπθ (yh+1 , ǎh , yh , ah−1 , yh−1 )|ǎh =πθ+α̟(θ) (yh ) dτ
τ h=1
Z X
|τ |
= γ h−1 f (τ1..h , xh+1 , yh+1 ; θ)∇α πθ+α̟(θ) (yh )⊤ ∇ǎh Aπθ (yh+1 , ǎh , yh , ah−1 , yh−1 )|ǎh =πθ+α̟(θ) (yh ) dτ
τ h=1
Now if we evaluate it at α = 0 we have
28
∇α Lπθ (πθ+α̟(θ) )|α=0
Z X |τ |
= γ h−1 f (τ1..h , xh+1 , yh+1 ; θ)∇α πθ+α̟(θ) |α=0 (yh )⊤ ∇ǎh Aπθ (yh+1 , ǎh , yh , ah−1 , yh−1 )|ǎh =πθ (yh ) dτ
τ h=1
Z X
|τ |
1 ⊤
= γ h−1 f (τ1..h , xh+1 , yh+1 ; θ) πθ⋆ (θ) (yh ) − πθ (yh ) ∇ǎh Aπθ (yh+1 , ǎh , yh , ah−1 , yh−1 )|ǎh =πθ (yh ) dτ
τ h=1
Z X
|τ |
2
≥ γ h−1 f (τ1..h , xh+1 , yh+1 ; θ) Aπθ (yh+1 , πθ⋆ (θ) (yh ), yh , ah−1 , yh−1 ) − Aπθ (yh+1 , πθ (yh ), yh , ah−1 , yh−1 ) dτ
τ h=1
= T Aπθ
where the equality (1) follows from the Asm. 3.2, and the inequality in (2) follows from the concavity
Asm. 3.1. As a consequence we have:
T Aπθ ≤ ∇α Lπθ (πθ+αu )|α=0

The results in the Lem. 3.1 indicates that
|τ |
X
η(πθ′ ) − η(πθ ) = Eτ ∼πθ′ γ h−1 Aπθ (yh+1 , ah , yh , ah−1 , yh−1 )
h=1
Z X
|τ |
= γ h−1 f (τ1..h , xh+1 , yh+1 ; θ ′ )Aπθ (yh+1 , ah , yh , ah−1 , yh−1 )dτ
τ h=1
In the following we use the definition of distribution mismatch coefficient to derive the statement
of the theorem. For any optimal parameter θ ∗ and applying Lem. 3.1 we have,
η(πθ∗ ) − η(πθ )
Z X |τ |
= γ h−1 f (τ1..h−1, xh , yh ; θ ∗ )πθ∗ (ah |yh )T (xh+1 |xh , ah )O(yt+1 |xt+1 )Aπθ (yh+1 , ah , yh , ah−1 , yh−1 )dτ
τ h=1
Z X
|τ |
≤ max γ h−1 f (τ1..h−1 , xh , yh ; θ ∗ )πθ′′ (ah |yh )T (xh+1 |xh , ah )O(yt+1 |xt+1 )Aπθ (yh+1 , ah , yh , ah−1 , yh−1 )dτ.
θ ∈Θ τ
′′
h=1
(23)
Let θ + , as defined in Eq. 10, denote a parameter in the set of parameters that achieve the
maximum in Eq. 23, i.e.,
Z X
|τ |
θ + ∈ arg max γ h−1 f (τ1..h−1 , xh , yh ; θ ∗ )πθ′′ (ah |yh )T (xh+1 |xh , ah )O(yt+1 |xt+1 )Aπθ (yh+1 , ah , yh , ah−1 , yh−1 )dτ.
θ ′′ ∈Θ τ h=1
(24)
29
Given the definition of πθ++ ,θ in Eq. 11, we have,
η(πθ∗ ) − η(πθ )
Z X |τ |
≤ γ h−1 f (τ1..h−1, xh , yh ; θ ∗ )πθ+ (ah |yh )T (xh+1 |xh , ah )O(yh+1 |xh+1 )Aπθ (yh+1 , ah , yh , ah−1 , yh−1 )dτ
τ h=1
Z X
|τ |
f (τ1..h−1, xh ; θ ∗ )
≤ γ h−1 f (τ1..h−1 , xh , yh ; θ)πθ++ ,θ (ah , yh , xh , ah−1 , yh−1 )
τ h=1 f (τ1..h−1 , xh ; θ)
Z X
|τ |
f (τ : θ ∗ )
≤k k∞ γ h−1 f (τ1..h−1 , xh , yh ; θ)πθ++ ,θ (ah , yh , xh , ah−1 , yh−1 )
f (τ : θ) τ h=1
T (xh+1 |xh , ah )O(yh+1 |xh+1 )Aπθ (yh+1 , ah , yh , ah−1 , yh−1 )dτ, (25)
where the last inequality step results from the fact that the integrand, i.e., the random variable
on the right hand side of the Eq. 25 is positive. Using the Lem. 3.1 again, we restate that
Z X
|τ |
γ h−1 f (τ1..h−1 , xh , yh ; θ)πθ (ah |yh )T (xh+1 |xh , ah )O(yh+1 |xh+1 )Aπθ (yh+1 , ah , yh , ah−1 , yh−1 )dτ
τ h=1
Z X
|τ |
= γ h−1 f (τ1..h, xh+1 , yh+1 ; θ)Aπθ (yh+1 , ah , yh , ah−1 , yh−1 )dτ
τ h=1
= η(θ) − η(θ) = 0,
therefore, we can subtract this term from the write hand side of Eq. 25 without changing the
direction of the inequality. As a result of this subtraction we have
f (τ : θ ∗ )
η(πθ∗ ) − η(πθ ) ≤ k k∞ ∆η + (θ + , θ) (26)
f (τ : θ)
Z X |τ |
where ∆η + (θ + , θ) := γ h−1 f (τ1..h−1 , xh , yh ; θ) πθ++ ,θ (ah , yh , xh , ah−1 , yh−1 ) − πθ (ah |yh )
τ h=1
T (xh+1 |xh , ah )O(yh+1 |xh+1 )Aπθ (yh+1 , ah , yh , ah−1 , yh−1 )dτ,
as noted in Eq. 12. Let us restate that in the case of MDP, and when the policy class is rich enough
(or the MDP is simple enough, e.g., fairly small tabular MDP), ∆η + (θ + , θ) vanishes.
30
Using the definitions of the Bellman policy error compatibility vector in Def. 3.5, we have:
η(πθ∗ ) − η(πθ )
Z X
|τ |
f (τ : θ ∗ )
≤k k∞ γ h−1 f (τ1..h−1 , xh , yh ; θ) πθ++ ,θ (ah , yh , xh , ah−1 , yh−1 ) − πθ (ah |yh )
f (τ : θ) τ h=1

Z X|τ | Z
f (τ : θ ∗ ) ⊤
−k k∞ γ h−1 f (τ1..h−1, xh+1 , yh+1 ; θ)ω + ∇θ πθ (a′h |yh )Aπθ (yh+1 , a′h , yh , ah−1 , yh−1 )da′h dτ
f (τ : θ) τ a′h
h=1
Z X
|τ | Z
f (τ : θ ∗ ) ⊤
+k k∞ γ h−1 f (τ1..h−1, xh+1 , yh+1 ; θ)ω + ∇θ πθ (a′h |yh )Aπθ (yh+1 , a′h , yh , ah−1 , yh−1 )da′h dτ
f (τ : θ) τ h=1 a′h
f (τ : θ ∗ ) f (τ : θ ∗ )
≤ ǫ+
θk k∞ + k k∞
f (τ : θ) f (τ : θ)
Z X |τ | Z
h−1 +⊤

γ f (τ1..h−1 , xh+1 , yh+1 ; θ)ω ∇θ πθ (a′h |yh )Aπθ (yh+1 , a′h , yh , ah−1 , yh−1 )da′h dτ
τ h=1 a′h
f (τ : θ∗) f (τ : θ ∗ ) ⊤
≤ ǫ+
θk k∞ + k k∞ ω +
f (τ : θ) f (τ : θ)
Z X|τ | Z

γ h−1 f (τ1..h−1 , xh+1 , yh+1 ; θ) ∇θ πθ (a′h |yh )Aπθ (yh+1 , a′h , yh , ah−1 , yh−1 )da′h dτ,
τ h=1 a′h
(27)
where using the result in Eq. 8 conclude the proof.
A.7 Proof of Lemma 4.1

Proof. The proof is based on a few following steps. Under the considerations of Lebesgue’s domi-
nated convergence theorem,
Z

∇θ′ KL(θ, θ )|θ =θ := − ∇θ′ f (τ ; θ) log f (τ ; θ ′ ) − log (f (τ ; θ)) dτ |θ′ =θ
2 ′
′
2
Z τ

= − f (τ ; θ)∇2θ′ log f (τ ; θ ′ ) dτ |θ′ =θ
Zτ
1 ′
= − f (τ ; θ)∇θ′ ∇θ′ f (τ ; θ ) dτ |θ′ =θ
τ f (τ ; θ ′ )
Z
1 ′ ′ ⊤
= f (τ ; θ) ∇θ′ f (τ ; θ )∇θ′ f (τ ; θ ) dτ |θ′ =θ
τ f (τ ; θ ′ )2
Z
1 2 ′
− f (τ ; θ) ∇ ′ f (τ ; θ ) dτ |θ′ =θ
τ f (τ ; θ ′ ) θ
Z Z
1
= f (τ ; θ) ∇θ′ f (τ ; θ )∇θ′ f (τ ; θ ) dτ |θ′ =θ − ∇θ′ f (τ ; θ ′ )dτ |θ′ =θ
′ ′ ⊤ 2
τ f (τ ; θ ′ )2 τ
=F (θ), (28)
31
which concludes the proof.

Proof. Following the result in the Lem. 3.1 we have
|τ |
X
η(πθ′ ) = η(πθ ) + Eτ ∼πθ′ γ h−1 Aπθ (yh+1 , ah , yh , yh−1 , ah−1 )
h=1
therefore, using the definition of the surrogate function, we have,
Z |τ |
X
η(πθ′ ) − Lπ (πθ′ ) = f (τ ; πθ′ ) γ h−1 Aπθ (yh+1 , ah , yh , yh−1 , ah−1 )dτ
τ h=1
Z |τ |
X
− f (τ ; πθ ) γ h−1 Aπθ (yh+1 , πθ′ (yh ), yh , yh−1 , ah−1 )dτ
τ h=1
Z |τ |
X
= f (τ ; πθ′ ) γ h−1 Aπθ (yh+1 , ah , yh , yh−1 , ah−1 )dτ
τ h=1
Z |τ |
X
− f (τ ; πθ′ ) γ h−1 Aπθ (yh+1 , πθ′ (yh ), yh , yh−1 , ah−1 )dτ
τ h=1
Z |τ |
X
+ f (τ ; πθ′ ) γ h−1 Aπθ (yh+1 , πθ′ (yh ), yh , yh−1 , ah−1 )dτ
τ h=1
Z |τ |
X
− f (τ ; πθ ) γ h−1 Aπθ (yh+1 , πθ′ (yh ), yh , yh−1 , ah−1 )dτ
τ h=1
Define advantage gap, ǫ, a notion of gap between MDP and POMDPs incorporating the effect
of future observations in the advantage functions,
Z |τ |
X

ǫ := max f (τ ; πθ′ ) γ h−1 Aπθ (yh+1 , ah , yh , yh−1 , ah−1 ) − Aπθ (yh+1 , πθ′ (yh ), yh , yh−1 , ah−1 dτ .
(θ,θ ′ )∈Θ×Θ τ h=1
(29)
Note that advantage gap ǫ is equal to zero in MDPs. Using the definition of ǫ, we have,
Z |τ |
X

η(πθ′ ) − Lπ (πθ′ ) ≤ ǫ + f (τ ; πθ′ ) γ h−1 Aπθ (yh+1 , πθ′ (yh ), yh , yh−1 , ah−1 )dτ
τ h=1
Z |τ |
X

− f (τ ; πθ ) γ h−1 Aπθ (yh+1 , πθ′ (yh ), yh , yh−1 , ah−1 )dτ . (30)
τ h=1
32
R
Note that, when we adopt f (τ ; θ)dτ := dPθ , and Pθ (Γ) = Γ f (τ ; θ)dτ , we encoded the absolute
continuity of measure Pθ for all θ ∈ Θ with respect to a positive measure, e.g., Lebesgue’s. Using
this consideration, we have
Z

η(πθ′ ) − Lπ (πθ′ ) ≤ ǫ + f (τ ; πθ′ ) − f (τ ; πθ )dτ
τ
= ǫ + ǫT V (τ ∼ f (·; πθ′ ), τ ∼ f (τ ; πθ ))
and deploying the Pinsker’s inequality we have have the first statement of the theorem,
|η(πθ′ ) − Lπθ (πθ′ )| ≤ ǫ + ǫT V (τ ∼ f (·; πθ′ ), τ ∼ f (τ ; πθ ))
r
1
≤ǫ+ǫ DKL (τ ∼ f (·; πθ′ ), τ ∼ f (τ ; πθ ))
2
For the second statement of the theorem, let us restate the decomposition in Eq. 30 under the
consideration in Fubini’s theorem,
X Z
h−1
η(πθ′ ) − Lπ (πθ′ ) ≤ ǫ + γ f (τ ; πθ′ )Aπθ (yh+1 , πθ′ (yh ), yh , yh−1 , ah−1 )dτ
h≥1 τ
X Z
h−1
− γ f (τ ; πθ )Aπθ (yh+1 , πθ′ (yh ), yh , yh−1 , ah−1 )dτ
h≥1 τ
X Z

=ǫ+ γ h−1 (f (τ ; πθ′ ) − f (τ ; πθ ))Aπθ (yh+1 , πθ′ (yh ), yh , yh−1 , ah−1 )dτ
h≥1 τ
X Z
h−1
≤ǫ+ γ f (τ ; πθ′ ) − f (τ ; πθ )Aπθ (yh+1 , πθ′ (yh ), yh , yh−1 , ah−1 )dτ
h≥1 τ
Applying the similar argument we used in proving the first statement, we have,
X Z
′ h−1
η(πθ ) − Lπ (πθ ) ≤ ǫ + ǫ
′ ′ γ f (τ ; πθ ) − f (τ ; πθ )dτ
′
h≥1 τ
X
≤ ǫ + ǫ′ γ h−1 T V (τ1..h ∼ f (·; πθ′ ), τ1..h ∼ f (·; πθ ))
h≥1
r
X 1
′ h−1
≤ǫ+ǫ γ DKL (τ1..h ∼ f (·; πθ′ ), τ1..h ∼ f (·; π))
2
h≥1
Deploying the definition of Dγ (πθ , πθ′ ),
r q
X 1
′ h−1 ′
η(πθ ) − Lπ (πθ ) ≤ ǫ + ǫ
′ ′ γ DKL (τ1..h ∼ f (·; πθ ), τ1..h ∼ f (·; π)) = ǫ + ǫ Dγ (πθ , πθ′ )
′
2
h≥1
and the second part of the theorem goes through.
33
References
Yasin Abbasi-Yadkori and Csaba Szepesvári. Regret bounds for the adaptive control of linear
quadratic systems. In Proceedings of the 24th Annual Conference on Learning Theory, pages
1–26, 2011.
Yasin Abbasi-Yadkori, Peter Bartlett, Kush Bhatia, Nevena Lazic, Csaba Szepesvari, and Gellért
Weisz. Politex: Regret bounds for policy iteration using expert prediction. In International
Conference on Machine Learning, pages 3692–3702, 2019.
Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. Constrained policy optimization. In
Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 22–31.
JMLR. org, 2017.
Alekh Agarwal, Sham M Kakade, Jason D Lee, and Gaurav Mahajan. Optimality and approximation
with policy gradient methods in markov decision processes. arXiv preprint arXiv:1908.00261,
2019.
V. M. Aleksandrov, V. I. Sysoyev, and V. V. Shemeneva. Stochastic optimization. Engineering

Cybernetics, 5(11-16):229–256, 1968.
Shun-ichi Amari. Information geometry and its applications. Springer, 2016.
András Antos, Csaba Szepesvári, and Rémi Munos. Learning near-optimal policies with bellman-
residual minimization based fitted policy iteration and a single sample path. Machine Learning,
71(1):89–129, 2008.
Kamyar Azizzadenesheli, Alessandro Lazaric, and Animashree Anandkumar. Reinforcement learn-

ing of pomdps using spectral methods. arXiv preprint arXiv:1602.07764, 2016.
Kamyar Azizzadenesheli, Alessandro Lazaric, and Animashree Anandkumar. Experimental results:

Reinforcement learning of pomdps using spectral methods. arXiv preprint arXiv:1705.02553,
2017.
Leemon C Baird III. Advantage updating. Technical report, WRIGHT LAB WRIGHT-
PATTERSON AFB OH, 1993.
Jonathan Baxter and Peter L. Bartlett. Reinforcement learning in pomdp’s via direct gradient
ascent. In Proceedings of the Seventeenth International Conference on Machine Learning, ICML
’00, pages 41–48, San Francisco, CA, USA, 2000. Morgan Kaufmann Publishers Inc. ISBN 1-
55860-707-2. URL http://dl.acm.org/citation.cfm?id=645529.757773.
Jonathan Baxter and Peter L Bartlett. Infinite-horizon policy-gradient estimation. Journal of

Artificial Intelligence Research, 15:319–350, 2001.
Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environ-
ment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:
253–279, 2013.
34
Jalaj Bhandari and Daniel Russo. Global optimality guarantees for policy gradient methods. arXiv
preprint arXiv:1906.01786, 2019.
Maryam Fazel, Rong Ge, Sham M Kakade, and Mehran Mesbahi. Global convergence of policy
gradient methods for the linear quadratic regulator. arXiv preprint arXiv:1801.05039, 2018.
Nan Jiang, Alex Kulesza, Satinder Singh, and Richard Lewis. The dependence of effective planning
horizon on model accuracy. In Proceedings of the 2015 International Conference on Autonomous
Agents and Multiagent Systems, pages 1181–1189, 2015.
Sham Kakade and John Langford. Approximately optimal approximate reinforcement learning. In
ICML, volume 2, pages 267–274, 2002.
Sham M Kakade. A natural policy gradient. In Advances in neural information processing systems,
pages 1531–1538, 2002.
Levente Kocsis and Csaba Szepesvári. Bandit based monte-carlo planning. In European conference
on machine learning, pages 282–293. Springer, 2006.
John M Lee. Riemannian manifolds: an introduction to curvature, volume 176. Springer Science &
Business Media, 2006.
Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa,
David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv
preprint arXiv:1509.02971, 2015.
Zachary C Lipton, Kamyar Azizzadenesheli, Abhishek Kumar, Lihong Li, Jianfeng Gao, and
Li Deng. Combating reinforcement learning’s sisyphean curse with intrinsic fear. arXiv preprint
arXiv:1611.01211, 2016.
Michael L Littman. Memoryless policies: Theoretical limitations and practical results. In From
Animals to Animats 3: Proceedings of the third international conference on simulation of adaptive
behavior, volume 3, page 238. MIT Press, 1994.
Bo Liu, Ji Liu, Mohammad Ghavamzadeh, Sridhar Mahadevan, and Marek Petrik. Finite-sample
analysis of proximal gradient td algorithms. In UAI, pages 504–513. Citeseer, 2015.
Omid Madani, Steve Hanks, and Anne Condon. On the undecidability of probabilistic planning and
infinite-horizon partially observable markov decision problems. In AAAI/IAAI, pages 541–548,
1999.
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare,
Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control
through deep reinforcement learning. Nature, 2015.
Guido Montufar, Keyan Ghazi-Zahedi, and Nihat Ay. Geometry and determinism of optimal station-
ary control in partially observable markov decision processes. arXiv preprint arXiv:1503.07206,
2015.
Christos H Papadimitriou and John N Tsitsiklis. The complexity of markov decision processes.
Mathematics of operations research, 12(3):441–450, 1987.
35
Martin L Puterman. Markov decision processes: discrete stochastic dynamic programming. John
Wiley & Sons, 2014.
R. Y. Rubinstein. Some problems in monte carlo optimization. Ph.D. thesis, 1969.
John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region
policy optimization. In International Conference on Machine Learning, pages 1889–1897, 2015.
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy
optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller.
Deterministic policy gradient algorithms. 2014.
David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche,
Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering
the game of go with deep neural networks and tree search. nature, 2016.
Satinder P Singh, Tommi Jaakkola, and Michael I Jordan. Learning without state-estimation in
partially observable markovian decision processes. In Machine Learning Proceedings 1994, pages
284–292. Elsevier, 1994.
Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient
methods for reinforcement learning with function approximation. In Advances in neural informa-
tion processing systems, pages 1057–1063, 2000.
Philip Thomas. Bias in natural actor-critic algorithms. In International conference on machine

learning, pages 441–448, 2014.
Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control.
In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pages
5026–5033. IEEE, 2012.
Nikos Vlassis, Michael L Littman, and David Barber. On the computational complexity of stochastic
controller optimization in pomdps. ACM Transactions on Computation Theory (TOCT), 4(4):12,
2012.
Lingxiao Wang, Qi Cai, Zhuoran Yang, and Zhaoran Wang. Neural policy gradient methods: Global
optimality and rates of convergence. arXiv preprint arXiv:1909.01150, 2019.
John K Williams and Satinder P Singh. Experimental results on learning stochastic memoryless
policies for partially observable markov decision processes. In Advances in Neural Information
Processing Systems, pages 1073–1080, 1999.
Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement

learning. Machine learning, 8(3-4):229–256, 1992.
36

Trpo For Policy Gradient RL PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Trpo For Policy Gradient RL PDF

Uploaded by

Copyright:

Available Formats

Policy Gradient in Partially Observable Environments:

Approximation and Convergence

California Institute of Technology, Pasadena

Extended motivation, problem setting, and contribution.

Detailed Summary of Contributions. In this paper, we study Markovian policies in episodic

Figure 1: POMDP under a Markovian policy

2. For each time step h ≥ 1, the policy receives observations yh ∼ O(·|xh ).

3. If xh = xT erminal , then |τ | is h − 1, and end the episode.

5. The environment transitions to a new state xh+1 ∼ T (·|xh , ah ).

6. h ← h + 1, repeat from Step (a).

Since M is an episodic environment, P(|τ | < ∞) = 1. As the definition of episodic environments

3.1 Score Function

3.2 Importance Weighting

∇θ′ η(θ ′ ) |θ′ = θ = Eτ |θ [∇θ log(f (τ ; θ))R(τ )] . (4)

3.3 Value-base Policy Gradient Theorem

Theorem 3.1 (Policy Gradient). For a given policy πθ on a POMDP M ;

Proof of the Thm 3.1 in Subsection A.1.

Theorem 3.2 (Deterministic Policy Gradient). For a given policy µθ on a POMDP M ;

Proof of the Thm 3.2 in Subsection A.2.

Aπ (xh , ah ) := Qπ (ah , xh ) − Vπ (xh ).

Aπ (yh+1 , ah , yh , ah−1 , yh−1 ) = Qπ (yh+1 , ah , yh ) − Vπ (yh , ah−1 , yh−1 ).

Proof of the Lem. 3.1 in Subsection A.3.

Proof of the Thm. 3.3 in Subsection A.4.

3.5 Gradient Dominance and Surrogate Function

3.6 Convergence Guarantee and Distribution Mismatch Coefficient

T (xh+1 |xh , ah )O(yh+1 |xh+1 )Aπθ (yh+1 , ah , yh , ah−1 , yh−1 )dτ

where ωθ+ , a compatibility vector of Bellman policy error, as a minimizer.

4 Trust Region Policy Optimization

For the optimality,

F (θ)ω = ∇θ η(θ) =⇒ ω ∗ = F (θ)−1 ∇ρ.

Lemma 4.1. Under a set of mild regularity conditions,

∇2θ′ DKL (θ, θ ′ )|θ′ =θ = F (θ), (14)

4.1 DKL vs. DKL T RP O

4.2 Construction of the Trust Region

We develop our trust region study upon both definitions of D and Dγ .

4.3 Generalized Trust Region Policy Optimization

1)η(πθ′ ) ≥ Lπθ (πθ′ ) − ǫT V (τ ∼ f (·; πθ′ ), τ ∼ f (τ ; πθ )) − ǫ

where ǫ is the advantage gap Eq.30 and is zero in case MDPs.

4.4 Generalized PPO:

πθ′ (a|x) πθ′ (a|x)

and repeating this decomposition results in

A.2 Proof of the Thm. 3.2

for ah in the set covered by µθ (xh ) given all θ.

e θ (µθ (xh ), xh ) = ∇θ µθ (xh )⊤ ∇a r(ah , xh )|a =µ (x )

therefore similar to the proof of the Thm. 3.1 we have,

which is the statement of the theorem.

A.3 Proof of the Lem. 3.1

Qπ (yh+1 , ah , yh ) := Eπ [rh |yh+1 , ah , yh ] + γVπ (yh+1 , ah , yh ). (22)

Given the definition of advantage function, we derive,

which is the statement of the Lemma.

By taking the gradient of the surrogate objective with respect to θ ′ , we have:

A.5 Proof of the Thm. 3.4

Therefore for the directional gradient we have:

Now if we evaluate it at α = 0 we have

T Aπθ ≤ ∇α Lπθ (πθ+αu )|α=0

A.6 Proof of the Thm. 3.5

T (xh+1 |xh , ah )O(yh+1 |xh+1 )Aπθ (yh+1 , ah , yh , ah−1 , yh−1 )dτ,

T (xh+1 |xh , ah )O(yh+1 |xh+1 )Aπθ (yh+1 , ah , yh , ah−1 , yh−1 )dτ

A.7 Proof of Lemma 4.1

A.8 Proof of the Thm. 4.1

therefore, using the definition of the surrogate function, we have,

Deploying the definition of Dγ (πθ , πθ′ ),

and the second part of the theorem goes through.