Professional Documents
Culture Documents
Abstract
In offline reinforcement learning (RL), the absence of active exploration calls for attention on the
model robustness to tackle the sim-to-real gap, where the discrepancy between the simulated and deployed
environments can significantly undermine the performance of the learned policy. To endow the learned
policy with robustness in a sample-efficient manner in the presence of high-dimensional state-action space,
this paper considers the sample complexity of distributionally robust linear Markov decision processes
(MDPs) with an uncertainty set characterized by the total variation distance using offline data. We
develop a pessimistic model-based algorithm and establish its sample complexity bound under minimal
e
data coverage assumptions, which outperforms prior art by at least O(d), where d is the feature dimension.
We further improve the performance guarantee of the proposed algorithm by incorporating a carefully-
designed variance estimator.
Contents
1 Introduction 2
1.1 Main contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Problem Setup 4
5 Conclusion 12
A Technical Lemmas 16
A.1 Proof of Lemma 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
A.2 Preliminary facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
∗ Department of Electrical and Computer Engineering, Carnegie Mellon University, PA 15213, USA.
† Department of Computing Mathematical Sciences, California Institute of Technology, CA 91125, USA.
1
B Analysis for DROP: Algorithm 1 18
B.1 Proof of equation (10) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
B.2 Two-fold subsampling method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
B.3 Proof of Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
B.4 Proof of Corollary 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
B.5 Proof of Corollary 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
1 Introduction
In reinforcement learning (RL), agents aim to learn an optimal policy that maximizes the expected total
rewards, by actively interacting with an unknown environment. However, online data collection may be pro-
hibitively expensive or potentially risky in many real-world applications, e.g., autonomous driving (Gu et al.,
2022), healthcare (Yu et al., 2021), and wireless security (Uprety and Rawat, 2020). This motivates the
study of offline RL, which leverages existing historical data (aka batch data) collected in the past to improve
policy learning, and has attracted growing attention (Levine et al., 2020). Nonetheless, the performance
of the learned policy invoking standard offline RL techniques could drop dramatically when the deployed
environment shifts from the one experienced by the history data even slightly, necessitating the development
of robust RL algorithms that are resilient against environmental uncertainty.
In response, recent years have witnessed a surge of interests in distributionally robust offline RL (Zhou et al.,
2021b; Yang et al., 2022; Shi and Chi, 2022; Blanchet et al., 2023). In particular, given only historical data
from a nominal environment, distributionally robust offline RL aims to learn a policy that optimizes the
worst-case performance when the environment falls into some prescribed uncertainty set around the nominal
one. Such a framework ensures that the performance of the learned policy does not fail drastically, provided
that the distribution shift between the nominal and deployment environments is not excessively large.
Nevertheless, most existing provable algorithms in distributionally robust offline RL only focus on the
tabular setting with finite state and action spaces (Zhou et al., 2021b; Yang et al., 2022; Shi and Chi, 2022),
where the sample complexity scales linearly with the size of the state-action space, which is prohibitive when
the problem is high-dimensional. To expand the reach of distributionally robust offline RL, a few latest
works (Ma et al., 2022; Blanchet et al., 2023) attempt to develop sample-efficient solutions by leveraging
linear function approximation (Bertsekas, 2012), which is widely used in both theoretic (Jin et al., 2020, 2021;
Xiong et al., 2023) and practical (Prashanth and Bhatnagar, 2010; Bellemare et al., 2019) developments of
standard RL. However, existing sample complexity is still far from satisfactory and notably larger than the
counterpart of standard offline RL (Jin et al., 2021; Xiong et al., 2023) with linear function approximation,
especially in terms of the dependency with the dimension of the feature space d. Therefore, it is natural to
ask:
Can we design a provably sample-efficient algorithm for distributionally robust offline RL with
linear representations?
2
• We propose a distributionally robust variant of pessimistic least-squares value iteration, referred to as
DROP, which incorporates linear representations of the MDP model and devises a data-driven penalty
function to account for data scarcity in the offline setting. We also establish the sub-optimality bound
for DROP under the minimal offline data assumption (cf. Theorem 1).
⋆
• We introduce a clipped single-policy concentrability coefficient Crob ≥ 1 tailored to characterize the
partial feature coverage of the offline data in Lin-RMDPs, and demonstrate that DROP attains ǫ-accuracy
e ⋆ d2 H 4 /ǫ2 ) (cf.
(for learning the robust optimal policy) as soon as the sample complexity is above O(C rob
Corollary 1). Compared with the prior art (Blanchet et al., 2023), DROP improves the sample complexity
e
by at least O(d).
• We further develop a variance-weighted variant of DROP by integrating a carefully designed variance
estimator, dubbed by DROP-V. Due to tighter control of the variance, DROP-V exhibits an improved
sub-optimality gap under the full feature coverage assumption (see Section 4).
See Table 1 for a summary of our results in terms of the sub-optimality gap.
Table 1: Our results and comparisons with prior art (up to log factors) in terms of sub-optimality gap,
for learning robust linear MDPs with an uncertainty set measured by the TV distance. Here, d is the
feature dimension, H is the horizon length, K is the number of trajectories, C1⋆ , Crob ⋆
∈ [1, ∞) are some
concentrability coefficients (defined in Section 3.3.1) satisfying Crob ≤ dC1 . In addition, P ρ (P 0 ) denotes
⋆ ⋆
the uncertainty set around the nominal kernel P 0 , φi (s, a) is the i-th coordinate of the feature vector given
any state-action pair (s, a) ∈ S × A, Λh and Σ⋆h are some sort of cumulative sample covariance matrix and
variance-weighted cumulative sample covariance matrix, satisfying HΛ−1 ⋆ −1
h (Σh ) .
Finite-sample guarantees for linear MDPs. Considering linear function approximation in RL, a
significant body of works study linear MDPs with linear transitions and rewards. Focusing on offline
settings, Jin et al. (2021) proposed Pessimistic Value Iteration (PEVI) for offline RL with finite-horizon
linear MDPs, which incorporated linear function approximation together with the principle of pessimism
(Rashidinejad et al., 2021; Li et al., 2024; Shi √
et al., 2022). However, the temporally statistical dependency
arising from backward updates leads to an O( d) amplification in the sub-optimality gap with comparison
to the lower bound (Zanette et al., 2021). Subsequently, Min et al. (2021) and Yin et al. (2022) attempted
to address this gap. The near-optimal sample complexity in Xiong et al. (2023) is achieved by applying
variance estimation techniques, motivating us to explore variance information for robust offline RL. Beyond
these works, there are many offline RL works that investigate model-free algorithms or the infinite-horizon
3
setting that deviates from our focus (Zanette et al., 2021; Uehara and Sun, 2021; Xie et al., 2021). In addi-
tion to the offline setting, linear MDPs are also widely explored in other settings such as generative model
or online RL (Yang and Wang, 2020; Jin et al., 2020; Agarwal et al., 2023; He et al., 2023).
Distributionally robust RL with tabular MDPs. To promote robustness in the face of environmental
uncertainty, a line of research known as distributionally robust RL incorporates distributionally robust
optimization tools with RL to ensure robustness in a worst-case manner (Iyengar, 2005; Xu and Mannor,
2012; Wolff et al., 2012; Kaufman and Schaefer, 2013; Ho et al., 2018; Smirnova et al., 2019; Ho et al., 2021;
Goyal and Grand-Clement, 2022; Derman and Mannor, 2020; Tamar et al., 2014; Badrinath and Kalathil,
2021; Ding et al., 2024). Recently, a surge of works focus on understanding/improving sample complexity
and computation complexity of RL algorithms in the tabular setting (Dong et al., 2022; Wang and Zou,
2021; Yang et al., 2022; Panaganti and Kalathil, 2022; Zhou et al., 2021b; Xu et al., 2023; Wang et al., 2023a;
Blanchet et al., 2023; Liu et al., 2022; Wang et al., 2023c; Liang et al., 2023; Wang et al., 2023b; Li et al.,
2022b; Kumar et al., 2023; Li and Lan, 2023; Yang et al., 2023; Zhang et al., 2023). Among them, Zhou et al.
(2021b); Yang et al. (2022); Shi and Chi (2022); Blanchet et al. (2023) consider distributionally robust RL
in the offline setting, which are most related to us. Different from the sample complexity achieved in tabular
setting that largely depends on the size of state and action spaces, this work advances beyond the tabular
setting and develop a sample complexity guarantee that only depends on the feature dimension based on the
linear MDP model assumption.
Distributionally robust RL with linear MDPs. The prior art (Blanchet et al., 2023) studies the
same robust linear MDP (Lin-RMDP) problem as this work, while the sample complexity dependency on
the feature dimension d is still worse than that of standard linear MDPs in offline setting (Jin et al., 2021;
Xiong et al., 2023), highlighting gap for potential improvements. Besides TV distance considered herein,
Ma et al. (2022); Blanchet et al. (2023) consider the Kullback-Leibler (KL) divergence for the uncertainty
set. We note that a concurrent work (Liu and Xu, 2024) explores Lin-RMDPs with an additional structure
assumption in the online setting, which diverges from our focus on the offline context. In addition to linear
representations for robust MDPs, Badrinath and Kalathil (2021); Ramesh et al. (2023) consider other classes
of function for model approximation.
Notation. Throughout this paper, we define ∆(S) as the probability simplex over a set S, [H] :=
{1, . . . , H} and [d] := {1, . . . , d} for some positive integer H, d > 0. We also denote 1i as the vector
with appropriate dimension, where the i-th coordinate is 1 and others are 0. We use Id to represent the
identity matrix of the size d. For any√vector x ∈ Rd , we use kxk2 and kxk1 to represent its l2 and l1 norm,
respectively. In addition, we denote x⊤ Ax as kxkA given any vector x ∈ Rd and any semi-definite matrix
A ∈ Rd×d . For any set D, we use |D| to represent the carzinality (i.e., the size) of the set D. Additionally,
we use min{a, b}+ to denote the minimum of a and b when a, b > 0, and 0 otherwise. We also let λmin (A)
to denote the smallest eigenvalue of any matrix A.
2 Problem Setup
In this section, we introduce the formulation of finite-horizon distributionally robust linear MDPs (Lin-
RMDPs), together with the batch data assumptions and the learning objective.
Standard linear MDPs. Consider a finite-horizon standard linear MDP M = (S, A, H, P = {Ph }H h=1 , r =
{rh }H
h=1 ), where S and A denote the state space and action space respectively, and H is the horizon length.
At each step h ∈ [H], we denote Ph : S × A → ∆(S) as the transition kernel and rh : S × A → [0, 1] as
the deterministic reward function, which satisfy the following assumption used in Yang and Wang (2019);
Jin et al. (2020).
Assumption 1 (Linear MDPs). A finite-horizon MDP M = (S, A, H, P, r) is called a linear MDP if given
a known feature map φ : S × A → Rd , there exist d unknown measures µP P P
h = (µh,1 , · · · , µh,d ) over the state
4
space S and an unknown vector θh ∈ Rd at each step h such that
Without
R loss of generality, we √ assume kφ(s, a)k2 ≤ 1 and φi (s, a) ≥ 0 for any (s, a, i) ∈ S × A × [d], and
max{ S kµP
h (s)k 2 ds, kθ h k 2 } ≤ d for all h ∈ [H].
In addition, we denote π = {πh }H h=1 as the policy of the agent, where πh : S → ∆(A) is the action
selection probability over the action space A at time step h. Given the policy π and the transition kernel P ,
the value function Vhπ,P and the Q-function Qπ,P
h at the h-th step are defined as: for any (s, a) ∈ S × A,
X
H X
H
Vhπ,P (s) = Eπ,P rt (st , at ) | sh = s , Qπ,P
h (s, a) = rh (s, a) + Eπ,P rt (st , at ) | sh = s, ah = a ,
t=h t=h+1
where the expectation is taken over the randomness of the trajectory induced by the policy π and the
transition kernel P .
Lin-RMDPs: distributionally robust linear MDPs. In this work, we consider distributionally robust
linear MDPs (Lin-RMDPs), where the transition kernel can be an arbitrary one within an uncertainty set
around the nominal kernel — an ensemble of probability transition kernels, rather than a fixed transition
in standard linear MDPs (Jin et al., 2021; Yin et al., 2022; Xiong et al., 2023). Formally, we denote it by
Mrob = (S, A, H, P ρ (P 0 ), r), where P 0 represents a nominal transition kernel and then P ρ (P 0 ) represents
the uncertainty set (a ball) around the nominal P 0 with some uncertainty level ρ ≥ 0. For notational
0
simplicity, we denote µ0h := µP h and according to (1), we let
∀(h, s, a, s′ ) ∈ [H] × S × A × S : 0
Ph,s,a (s′ ) := Ph0 (s′ | s, a) = φ(s, a)⊤ µ0h (s′ ).
In this work, we consider the total variation (TV) distance (Tsybakov, 2008) as the divergence metric for the
uncertainty set P ρ and we assume that Lin-RMDPs satisfy the following d-rectangular assumption (Ma et al.,
2022; Blanchet et al., 2023).
Assumption 2 (Lin-RMDPs). In Lin-RMDPs, the uncertainty set P ρ (P 0 ) is d-rectangular, i.e., µ0h,i ∈ ∆(S)
for any (h, i) ∈ [H] × [d] and
P ρ (P 0 ) := ⊗[H],S,A P ρ (Ph,s,a
0
), with P ρ (Ph,s,a
0
) := φ(s, a)⊤ µh (·) : µh ∈ U ρ (µ0h ) ,
where U ρ (µ0h ) := ⊗[d] U ρ (µ0h,i ), with U ρ (µ0h,i ) := µh,i : DTV µh,i k µ0h,i ≤ ρ and µh,i ∈ ∆(S) .
Here, DTV (µh,i k µ0h,i ) represents the TV-distance between two measures over the state space, i.e.
1
DTV (µh,i , µ0h,i ) = kµh,i − µ0h,i k1 ,
2
where ⊗[d] (resp. ⊗[H],S,A ) denotes Cartesian products over [d] (resp. [H], S, and A).
Assumption 2 indicates that the uncertainty set can be decoupled into each feature dimension i ∈ [d]
independently, so called d-rectangularity. Note that by letting d = SA and φi (s, a) = 1i for any (i, s, a) ∈
[d] × S × A, the Lin-RMDP is instantiated to the tabular RMDP and d rectangularity becomes the (s, a)-
rectangularity commonly used in prior literatures (Yang et al., 2022; Shi et al., 2023). Moreover, when the
uncertainty level ρ = 0, Lin-RMDPs reduce to a subclass of standard linear MDPs satisfying Assumption 1.
Robust value function and robust Bellman equations. Considering Lin-RMDPs with any given
policy π, we define robust value function (resp. robust Q-function) to characterize the worst-case performance
induced by all possible transition kernels over the uncertainty set, denoted as
5
They satisfy the following robust Bellman consistency equations:
and the robust linear Bellman operator and transition operator for any function f : S → R is defined by
Note that under Assumption 2, the robust Bellman operator inherits the linearity of the Bellman operator
in standard linear MDPs (Jin et al., 2020, Proposition 2.3), as shown in the following lemma. The proof is
postponed to Appendix A.1.
Lemma 1 (Linearity of robust Bellman operators). Suppose that the finite horizon Lin-RMDPs R satisfies
Assumption 1 and 2. There exist weights wρ = {whρ }H ρ ′ ′ ′
h=1 , where wh := θh + inf µh ∈U ρ (µ0h ) S µh (s )f (s )ds for
ρ ρ ⊤ ρ
any h ∈ [H], such that Bh f (s, a) is linear with respect to the feature map φ, i.e., Bh f (s, a) = φ(s, a) wh .
In addition, conditioned on some initial state distribution ζ, we define the induced occupancy distribution
w.r.t. any policy π and transition kernel P by
To continue, we denote π ⋆ = {πh⋆ }H h=1 as a deterministic optimal robust policy (Iyengar, 2005). The
resulting optimal robust value function and optimal robust Q-function are defined as:
⋆
Vh⋆,ρ (s) := Vhπ ,ρ
(s) = max Vhπ,ρ (s), ∀(h, s) ∈ [H] × S,
π
⋆
Q⋆,ρ π
h (s, a) := Qh
,ρ
(s, a) = max Qπ,ρ
h (s, a), ∀(h, s, a) ∈ [H] × S × A.
π
Q⋆,ρ ρ ⋆,ρ
h (s, a) = [Bh Vh+1 ](s, a), ∀(h, s, a) ∈ [H] × S × A. (6)
Similar to (5), we also denote the occupancy distribution associated with the optimal robust policy π ⋆ and
some transition kernel P by
⋆ ⋆
∀(h, s, a) ∈ [H] × S × A : d⋆,P π
h (s) := dh
,P
(s; ζ), d⋆,P π
h (s, a) := dh
,P
(s)πh⋆ (a | s). (7)
Offline data. Suppose that we can access a batch dataset D = {(sτh , aτh , rhτ , sτh+1 )}h∈[H],τ ∈[K], which con-
sists K i.i.d. trajectories that are generated by executing some (mixed) behavior policy π b = {πhb }H h=1 over
some nominal linear MDP M0 = (S, A, H, P 0 , r). Note that D contains KH transition-reward sample tuples
in total, where each sample tuple (sτh , aτh , rhτ , sτh+1 ) represents that the agent took the action aτh at the state
sτh , received the reward rhτ = rh (sτh , aτh ), and then observed the next state sτh+1 ∼ Ph0 (· | sh = sτh , ah = aτh ).
To proceed, we define
Dh0 = {(sτh , aτh , rhτ , sτh+1 )}τ ∈[K],
which contains all samples at the h-th step in D. For simplicity, we abuse the notation τ ∈ Dh0 to denote
(sτh , aτh , rhτ , sτh+1 ) ∈ Dh0 . In addition, we define the induced occupancy distribution w.r.t. the behavior policy
π b and the nominal transition kernel P 0 at each step h by
b
,P 0 b
,P 0
∀(h, s, a) ∈ [H] × S × A : dbh (s) := dhπ (s; ζ) and dbh (s, a) := dπh (s, a; ζ), (8)
6
Learning goal. Given the batch dataset D, the goal of solving the Lin-RMDP Mrob with a given initial
b such that the sub-optimality gap satisfies
state distribution ζ is to learn an ǫ-optimal robust policy π
π ; ζ, P ρ ) := V1⋆,ρ (ζ) − V1πb,ρ (ζ) ≤ ǫ,
SubOpt(b (9)
using as few samples as possible, where ǫ is the targeted accuracy level and
V1⋆,ρ (ζ) = Es1 ∼ζ [V1⋆,ρ (s1 )], V1πb,ρ (ζ) = Es1 ∼ζ [V1πb,ρ (s1 )].
V
where λ0 > 0 is the regularization coefficient, ν̄h,i (α) is the i-th coordinate of ν̄hV (α) defined by
X X
ν̄hV (α) = arg min (φ(sτh , aτh )⊤ ν − [V ]α (sτh+1 ))2 + λ0 kνk22 = Λ−1
h φ(s τ
,
h h a τ
)[V ] α (s τ
h+1 ) , (14)
ν∈Rd 0 0
τ ∈Dh τ ∈Dh
Then following the linearity of the robust Bellman operator shown in Lemma 1, we construct the empirical
robust Bellman operator B b ρ to approximate Bρ , using the estimators obtained from (12) and (13): for any
h h
value function V : S → [0, H],
b ρ V )(s, a) = φ(s, a)⊤ (θbh + νbρ,V ),
(B ∀(s, a, h) ∈ S × A × [H]. (16)
h h
7
Algorithm 1 Distributionally Robust Pessimistic Least-Squares Value Iteration (DROP)
Input: Dataset D; feature map φ(s, a) for (s, a) ∈ S × A; parameters λ0 , γ0 > 0.
1: Construct a temporally independent dataset D0 = Two-fold-subsampling(D) (Algorithm 2).
Initialization: Set Q b H+1 (·, ·) = 0 and VbH+1 (·) = 0.
2: for step h = H, H − 1, · · · , 1 do
P
3: Λh = τ ∈D0 φ(sτh , aτh )φ(sτh , aτh )⊤ + λ0 Id .
P h
4: θbh = Λ−1 h τ ∈Dh 0 φ(s τ
a τ τ
h h h .
, )r
5: for feature i = 1, · · · , d do
ρ,Vb
6: Update νbh,i via (13).
7: end for
b b
8: wbhρ,V = θbh + νbhρ,V .
P
bhρ,V − γ0 di=1 kφi (·, ·)1i kΛ−1 .
b
9: Q̄h (·, ·) = φ(·, ·)⊤ w
h
8
Theorem 1. Consider any δ ∈ (0, 1). Suppose that Assumption 1 and 2 hold. By setting
p
λ0 = 1, γ0 = 6 dξ0 H, where ξ0 = log(3HK/δ),
√ H X
X d h i
SubOpt(b e dH)
π ; ζ, P ρ ) ≤ O( max 0
Eπ ⋆ ,P kφi (s h , a h )1 i k Λ−1 . (17)
P ∈P (P )
ρ h
h=1 i=1
Since we do not impose any coverage assumption on the batch data, Theorem 1 demonstrates an “instance-
dependent” sub-optimality gap, which can be controlled by some general term (the right hand side of (17)).
The sub-optimality gap largely depends on the quality of the batch data — specifically, how well it explores
the feature space within Rd . Building upon the above foundational result, we proceed to examine the sample
complexity required to achieve an ǫ-optimal policy, considering different data qualities in the subsequent
discussion.
⋆
for some finite quantity Crob ∈ [1, ∞). In addition, we follow the convention 0/0 = 0.
Under Assumption 3, the following corollary provides the provable sample complexity of DROP to achieve
an ǫ-optimal robust policy. The proof is postponed to Appendix B.4.
Corollary 1 (Partial feature coverage). With Theorem 1 and Assumption 3 hold and consider any δ ∈ (0, 1).
Let dbmin = minh,s,a {dbh (s, a) : dbh (s, a) > 0}. With probability exceeding 1 − 4δ, the policy π
b returned by
Algorithm 1 achieves
r
⋆ log(3HK/δ)
d2 H 4 Crob
ρ
SubOpt(b π ; ζ, P ) ≤ 96
K
as long as K ≥ c0 log(KH/δ)/dbmin for some sufficiently large universal constant c0 > 0. In other words, the
b is ǫ-optimal if the total number of sample trajectories satisfies
learned policy π
⋆ 2 4
e Crob d H .
K ≥O (19)
ǫ2
Notice that Corollary 1 implies the sub-optimality bound of DROP is comparable to that of the prior art in
standard linear MDPs (Jin et al., 2021, Corollary 4.5), in terms of the feature dimension d and the horizon
length H.
Comparisons to prior art for Lin-RMDPs. With high probability, the existing Assumption 6.2 pro-
posed in Blanchet et al. (2023) can be transfered into the following condition (see (66)-(68) in Appendix
B.4):
u⊤ Ed⋆,P φ2i (s, a) · 1i,i u
max h ≤ C1⋆ ∈ [1, ∞). (20)
(u,h,i,P )∈Rd ×[H]×[d]×P ρ (P 0 ) u⊤ E b [φ(s, a)φ(s, a)⊤ ] u
dh
9
⋆
Thanks to the clipping operation in (18), Crob ≤ d · C1⋆ for any given batch dataset D. Notice that both Crob
⋆
⋆
and C1 are lower bounded by 1. The proposed algorithm in Blanchet et al. (2023) can achieve ǫ-accuracy
provided that the total number of sample trajectories obeys
⋆ 4 4
e C1 d H .
K ≥O
ǫ2
e
It shows that the sample complexity (19) of DROP improves the one in Blanchet et al. (2023) by at least O(d).
Constructing a variance estimator. First, we run the Algorithm 1 on a sub-dataset D̃0 ∈ D to obtain
the estimated value function {Ṽh }H+1 H+1
h=1 . Then with {Ṽh }h=1 at our hands, we design the variance estimator
2 2
bh : S × A → [1, H ] by
σ
2
bh2 (s, a) = max{[φ(s, a)⊤ νh,1 ][0,H 2 ] − [φ(s, a)⊤ νh,2 ][0,H]
σ , 1}, ∀(s, a, h) ∈ S × A × [H], (22)
10
where νh,1 and νh,2 are obtained from ridge regression as follows:
X 2
νh,1 = arg min φ(sτh , aτh )⊤ ν − (Ṽh+1 (sτh+1 ))2 + kνk22 , (23)
ν∈Rd
0
τ ∈D̃h
X 2
νh,2 = arg min φ(sτh , aτh )⊤ ν − Ṽh+1 (sτh+1 ) + kνk22 . (24)
ν∈Rd
0
τ ∈D̃h
Notice that the technique of variance estimation is also used in Xiong et al. (2023); Yin et al. (2022) for
standard linear MDPs, while we addresses the temporal dependency via a carefully designed three-fold
subsampling approach detailed in Appendix C.2.
Variance-weighted ridge regression. Incorporating with the variance estimator σ bh2 computed on D̃0 , we
replace the ridge regression updates (i.e. (12) and (14)) by their variance-weighted counterparts as follows:
X (φ(sτ , aτ )⊤ θ − rτ )2 X φ(sτ , aτ )rτ
θbhσ = arg min h h h
+ λ1 kθk 2
2 = Σ −1
h
h h h
, (25)
θ∈Rd 0
bh2 (sτh , aτh )
σ 0
bh2 (sτh , aτh )
σ
τ ∈Dh τ ∈Dh
X (φ(sτh , aτh )⊤ ν − [V ]α (sτh+1 ))2 X φ(sτ , aτ )[V ] (sτ )
α h+1
ν̄hρ,σ,V (α) = arg min +λ1 kνk22 = Σ−1
h
h h
, (26)
ν∈Rd 0
bh2 (sτh , aτh )
σ 0
σbh (sh , aτh )
2 τ
τ ∈Dh τ ∈Dh
with the regularization coefficient λ1 and D0 is another sub-dataset of D that is independent of D̃0 . Accord-
ingly, DROP-V constructs an empirical variance-aware robust Bellman operator as
b ρ,σ V )(s, a) = φ(s, a)⊤ (θbσ + νbρ,σ,V ),
(B (27)
h h h
Similar to DROP, we also perform the pessimistic value iterations, where the estimated Q function is
updated by
d
X
b ρ,σ (Vbh+1 )(s, a) − γ1
Q̄h (s, a) = B h kφi (s, a)1i kΣ−1 , ∀(s, a, h) ∈ S × A × [H],
h
i=1
| {z }
penalty function Γσ
h :S×A→R
with a fixed, estimated Vbh+1 obtained from previous iteration and some coefficient γ1 > 0.
The rest of DROP-V follows the procedure described in Algorithm 1. To avoid redundancy, the detailed
implementation of DROP-V is provided in Appendix C.1.
11
√ H 4 +H 6 dκ
Provided that d ≥ H and K ≥ O e + 1
, then with probability exceeding 1 − δ, the robust policy
κ2 db
min
b learned by DROP-V satisfies
π
√ XH X
d h i
SubOpt(b e d)
π ; ζ, P ρ ) ≤ O( max
ρ 0
Eπ ⋆ ,P kφi (s h , a h )1 i k (Σ⋆ )−1 ,
h
(29)
P ∈P (P )
h=1 i=1
R R
0
and VarPh0 [V ](s, a) = S Ph,s,a (s′ )V 2 (s′ )ds′ − ( S Ph,s,a
0
(s′ )V (s′ )ds′ )2 represents the conditional variance for
any value function V : S → [0, H] and any (s, a, h) ∈ S × A × [H].
Compared to the instance-dependent sub-optimality bound of DROP (cf. Theorem 1), the above guarantee
for DROP-V is tighter since H 2 Λ−1 ⋆ −1
h (Σh ) . The underlying reason for the improvement is the use of the
variance estimator to control the conditional variance and the tighter penalty function designed via the
Bernstein-type inequality, while that of DROP depends on the Hoeffding-type counterpart.
5 Conclusion
In this paper, we investigate the sample complexity for distributionally robust offline RL when the model
has linear representations and the uncertainty set can be characterized by TV distance. We develop a robust
variant of pessimistic value iteration with linear function approximation, called DROP. We establish the sub-
optimality guarantees for DROP under various offline data assumptions. Compared to the prior art, DROP
notably improves the sample complexity by at least O(d), e under the partial feature coverage assumption.
We further incorporate DROP with variance estimation to develop an enhanced DROP, referred to as DROP-V,
which improves the sub-optimality bound of DROP. In the future, it will be of interest to consider different
choices of the uncertainty sets and establish the lower bound for the entire range of the uncertainty level.
Acknowledgement
This work is supported in part by the grants ONR N00014-19-1-2404, NSF CCF-2106778, DMS-2134080,
and CNS-2148212.
References
Abbasi-Yadkori, Y., Pál, D., and Szepesvári, C. (2011). Improved algorithms for linear stochastic bandits.
Advances in Neural Information Processing Systems, 24.
Agarwal, A., Jin, Y., and Zhang, T. (2023). VOQL: Towards optimal regret in model-free RL with nonlinear
function approximation. The Thirty Sixth Annual Conference on Learning Theory, pages 987–1063.
Badrinath, K. P. and Kalathil, D. (2021). Robust reinforcement learning using least squares policy iteration
with provable performance guarantees. In International Conference on Machine Learning, pages 511–520.
PMLR.
Bellemare, M. G., Le Roux, N., Castro, P. S., and Moitra, S. (2019). Distributional reinforcement learning
with linear function approximation. In The 22nd International Conference on Artificial Intelligence and
Statistics, pages 2203–2211. PMLR.
Bertsekas, D. (2012). Dynamic programming and optimal control: Volume I, volume 4. Athena scientific.
12
Blanchet, J., Lu, M., Zhang, T., and Zhong, H. (2023). Double pessimism is provably efficient for distri-
butionally robust offline reinforcement learning: Generic algorithm and robust partial coverage. arXiv
preprint arXiv:2305.09659.
Derman, E. and Mannor, S. (2020). Distributional robustness and regularization in reinforcement learning.
arXiv preprint arXiv:2003.02894.
Ding, W., Shi, L., Chi, Y., and Zhao, D. (2024). Seeing is not believing: Robust reinforcement learning
against spurious correlation. Advances in Neural Information Processing Systems, 36.
Dong, J., Li, J., Wang, B., and Zhang, J. (2022). Online policy optimization for robust MDP. arXiv preprint
arXiv:2209.13841.
Goyal, V. and Grand-Clement, J. (2022). Robust markov decision processes: Beyond rectangularity. Math-
ematics of Operations Research.
Gu, S., Chen, G., Zhang, L., Hou, J., Hu, Y., and Knoll, A. (2022). Constrained reinforcement learning for
vehicle motion planning with topological reachability analysis. Robotics, 11(4):81.
He, J., Zhao, H., Zhou, D., and Gu, Q. (2023). Nearly minimax optimal reinforcement learning for linear
markov decision processes. In International Conference on Machine Learning, pages 12790–12822. PMLR.
Ho, C. P., Petrik, M., and Wiesemann, W. (2018). Fast bellman updates for robust MDPs. In International
Conference on Machine Learning, pages 1979–1988. PMLR.
Ho, C. P., Petrik, M., and Wiesemann, W. (2021). Partial policy iteration for l1-robust markov decision
processes. Journal of Machine Learning Research, 22(275):1–46.
Iyengar, G. N. (2005). Robust dynamic programming. Mathematics of Operations Research, 30(2):257–280.
Jin, C., Yang, Z., Wang, Z., and Jordan, M. I. (2020). Provably efficient reinforcement learning with linear
function approximation. In Conference on Learning Theory, pages 2137–2143. PMLR.
Jin, Y., Yang, Z., and Wang, Z. (2021). Is pessimism provably efficient for offline RL? In International
Conference on Machine Learning, pages 5084–5096.
Kaufman, D. L. and Schaefer, A. J. (2013). Robust modified policy iteration. INFORMS Journal on
Computing, 25(3):396–410.
Kumar, N., Derman, E., Geist, M., Levy, K., and Mannor, S. (2023). Policy gradient for s-rectangular robust
markov decision processes. arXiv preprint arXiv:2301.13589.
Levine, S., Kumar, A., Tucker, G., and Fu, J. (2020). Offline reinforcement learning: Tutorial, review, and
perspectives on open problems. arXiv preprint arXiv:2005.01643.
Li, G., Shi, L., Chen, Y., Chi, Y., and Wei, Y. (2022a). Settling the sample complexity of model-based offline
reinforcement learning. arXiv preprint arXiv:2204.05275.
Li, G., Shi, L., Chen, Y., Chi, Y., and Wei, Y. (2024). Settling the sample complexity of model-based offline
reinforcement learning. The Annals of Statistics, 52(1):233–260.
Li, Y. and Lan, G. (2023). First-order policy optimization for robust policy evaluation. arXiv preprint
arXiv:2307.15890.
Li, Y., Zhao, T., and Lan, G. (2022b). First-order policy optimization for robust markov decision process.
arXiv preprint arXiv:2209.10579.
Liang, Z., Ma, X., Blanchet, J., Zhang, J., and Zhou, Z. (2023). Single-trajectory distributionally robust
reinforcement learning. arXiv preprint arXiv:2301.11721.
Liu, Z., Bai, Q., Blanchet, J., Dong, P., Xu, W., Zhou, Z., and Zhou, Z. (2022). Distributionally robust
q-learning. In International Conference on Machine Learning, pages 13623–13643. PMLR.
13
Liu, Z. and Xu, P. (2024). Distributionally robust off-dynamics reinforcement learning: Provable efficiency
with linear function approximation. arXiv preprint arXiv:2402.15399.
Ma, X., Liang, Z., Blanchet, J., Liu, M., Xia, L., Zhang, J., Zhao, Q., and Zhou, Z. (2022). Distributionally
robust offline reinforcement learning with linear function approximation. arXiv preprint arXiv:2209.06620.
Min, Y., Wang, T., Zhou, D., and Gu, Q. (2021). Variance-aware off-policy evaluation with linear function
approximation. Advances in Neural Information Processing Systems, 34:7598–7610.
Pan, Y., Chen, Y., and Lin, F. (2024). Adjustable robust reinforcement learning for online 3d bin packing.
Advances in Neural Information Processing Systems, 36.
Panaganti, K. and Kalathil, D. (2022). Sample complexity of robust reinforcement learning with a generative
model. In International Conference on Artificial Intelligence and Statistics, pages 9582–9602. PMLR.
Prashanth, L. and Bhatnagar, S. (2010). Reinforcement learning with function approximation for traffic
signal control. IEEE Transactions on Intelligent Transportation Systems, 12(2):412–421.
Ramesh, S. S., Sessa, P. G., Hu, Y., Krause, A., and Bogunovic, I. (2023). Distributionally robust model-
based reinforcement learning with large state spaces. arXiv preprint arXiv:2309.02236.
Rashidinejad, P., Zhu, B., Ma, C., Jiao, J., and Russell, S. (2021). Bridging offline reinforcement learning and
imitation learning: A tale of pessimism. Advances in Neural Information Processing Systems, 34:11702–
11716.
Shi, L. and Chi, Y. (2022). Distributionally robust model-based offline reinforcement learning with near-
optimal sample complexity. arXiv preprint arXiv:2208.05767.
Shi, L., Li, G., Wei, Y., Chen, Y., and Chi, Y. (2022). Pessimistic Q-learning for offline reinforcement
learning: Towards optimal sample complexity. In Proceedings of the 39th International Conference on
Machine Learning, volume 162, pages 19967–20025. PMLR.
Shi, L., Li, G., Wei, Y., Chen, Y., Geist, M., and Chi, Y. (2023). The curious price of distributional
robustness in reinforcement learning with a generative model. Advances in Neural Information Processing
Systems, 36.
Smirnova, E., Dohmatob, E., and Mary, J. (2019). Distributionally robust reinforcement learning. arXiv
preprint arXiv:1902.08708.
Tamar, A., Mannor, S., and Xu, H. (2014). Scaling up robust MDPs using function approximation. In
International conference on machine learning, pages 181–189. PMLR.
Tsybakov, A. B. (2008). Introduction to Nonparametric Estimation. Springer Publishing Company, Incorpo-
rated, 1st edition.
Uehara, M. and Sun, W. (2021). Pessimistic model-based offline reinforcement learning under partial coverage.
arXiv preprint arXiv:2107.06226.
Uprety, A. and Rawat, D. B. (2020). Reinforcement learning for iot security: A comprehensive survey. IEEE
Internet of Things Journal, 8(11):8693–8706.
Vershynin, R. (2018). High-dimensional probability: An introduction with applications in data science, vol-
ume 47. Cambridge university press.
Wang, R., Foster, D. P., and Kakade, S. M. (2020). What are the statistical limits of offline rl with linear
function approximation? arXiv preprint arXiv:2010.11895.
Wang, S., Si, N., Blanchet, J., and Zhou, Z. (2023a). A finite sample complexity bound for distributionally
robust q-learning. arXiv preprint arXiv:2302.13203.
14
Wang, S., Si, N., Blanchet, J., and Zhou, Z. (2023b). On the foundation of distributionally robust reinforce-
ment learning. arXiv preprint arXiv:2311.09018.
Wang, S., Si, N., Blanchet, J., and Zhou, Z. (2023c). Sample complexity of variance-reduced distributionally
robust Q-learning. arXiv preprint arXiv:2305.18420.
Wang, Y. and Zou, S. (2021). Online robust reinforcement learning with model uncertainty. Advances in
Neural Information Processing Systems, 34.
Wolff, E. M., Topcu, U., and Murray, R. M. (2012). Robust control of uncertain markov decision processes
with temporal logic specifications. In 2012 IEEE 51st IEEE Conference on Decision and Control (CDC),
pages 3372–3379. IEEE.
Xie, T., Cheng, C.-A., Jiang, N., Mineiro, P., and Agarwal, A. (2021). Bellman-consistent pessimism for
offline reinforcement learning. arXiv preprint arXiv:2106.06926.
Xiong, W., Zhong, H., Shi, C., Shen, C., Wang, L., and Zhang, T. (2023). Nearly minimax optimal offline
reinforcement learning with linear function approximation: Single-agent MDP and markov game. In The
Eleventh International Conference on Learning Representations.
Xu, H. and Mannor, S. (2012). Distributionally robust Markov decision processes. Mathematics of Operations
Research, 37(2):288–300.
Xu, Z., Panaganti, K., and Kalathil, D. (2023). Improved sample complexity bounds for distributionally
robust reinforcement learning. arXiv preprint arXiv:2303.02783.
Yang, L. and Wang, M. (2019). Sample-optimal parametric Q-learning using linearly additive features. In
International Conference on Machine Learning, pages 6995–7004.
Yang, L. and Wang, M. (2020). Reinforcement learning in feature space: Matrix bandit, kernels, and regret
bound. In International Conference on Machine Learning, pages 10746–10756. PMLR.
Yang, W., Wang, H., Kozuno, T., Jordan, S. M., and Zhang, Z. (2023). Avoiding model estimation in robust
markov decision processes with a generative model. arXiv preprint arXiv:2302.01248.
Yang, W., Zhang, L., and Zhang, Z. (2022). Toward theoretical understandings of robust Markov decision
processes: Sample complexity and asymptotics. The Annals of Statistics, 50(6):3223–3248.
Yin, M., Duan, Y., Wang, M., and Wang, Y.-X. (2022). Near-optimal offline reinforcement learning with
linear representation: Leveraging variance information with pessimism. arXiv preprint arXiv:2203.05804.
Yu, C., Liu, J., Nemati, S., and Yin, G. (2021). Reinforcement learning in healthcare: A survey. ACM
Computing Surveys, 55(1):1–36.
Zanette, A., Wainwright, M. J., and Brunskill, E. (2021). Provable benefits of actor-critic methods for offline
reinforcement learning. arXiv preprint arXiv:2108.08812.
Zhang, R., Hu, Y., and Li, N. (2023). Regularized robust mdps and risk-sensitive mdps: Equivalence, policy
gradient, and sample complexity. arXiv preprint arXiv:2306.11626.
Zhou, D., Gu, Q., and Szepesvari, C. (2021a). Nearly minimax optimal reinforcement learning for linear
mixture markov decision processes. In Proceedings of Thirty Fourth Conference on Learning Theory,
volume 134 of Proceedings of Machine Learning Research, pages 4532–4576. PMLR.
Zhou, Z., Bai, Q., Zhou, Z., Qiu, L., Blanchet, J., and Glynn, P. (2021b). Finite-sample regret bound for
distributionally robust offline tabular reinforcement learning. In International Conference on Artificial
Intelligence and Statistics, pages 3331–3339. PMLR.
15
A Technical Lemmas
A.1 Proof of Lemma 1
For a Lin-RMDP satisfying Assumption 2, the robust linear transition operator defined in (4) obeys: for any
time step h ∈ [H],
Z
(Pρh f )(s, a) = inf 0 φ(s, a)⊤ µh (s′ )f (s′ )ds′
µh ∈U ρ (µh ) S
d
X Z
= inf φi (s, a) µh,i (s′ )f (s′ )ds′
µh ∈U ρ (µ0h ) S
i=1
d
X Z
= φi (s, a) inf µh,i (s′ )f (s′ )ds′
µh,i ∈U ρ (µ0h,i ) S
i=1
Z
= φ(s, a)⊤ inf µh (s′ )f (s′ )ds′ ,
µh ∈U ρ (µ0h ) S
where the penultimate equality is due to φi (s, a) ≥ 0, ∀(i, s, a) ∈ [d] × S × A and U ρ (µ0h ) := ⊗[d]U ρ (µ0h,i ).
Therefore, the robust linear Bellman operator defined in (3) is linear in the feature map φ: for (h, s, a) ∈
[H] × S × A,
Lemma 3 (Bernstein-type inequality for self-normalized process (Zhou et al., 2021a)). Let {ηt }∞ t=1 be a
real-valued stochastic process and let {Ft }∞
t=0 be a filtration such that ηt is F t -measurable.
Pt Let {xt } ∞
t=1 be an
Rd -valued stochastic process where xt is Ft−1 measurable and xt ≤ L. Let Λt = λId + s=1 xs x⊤ s . Assume
that
|ηt | ≤ R, E[ηt |Ft−1 ] = 0, E[ηt2 |Ft−1 ] ≤ σ 2 .
Then for any δ > 0, with probability at least 1 − δ, for all t > 0, we have
s
Xt
tL2 4t2 4t2
k xs ηs kΛ−1 ≤ 8σ d log 1 + log( ) + 4R log( ).
s=1
t λd δ δ
16
Lemma 4 (Lemma H.5, Min et al. (2021)). Let φ : S ×A → Rd be a bounded function such that kφ(s, a)k2 ≤
PK
C for all (s, a) ∈ S × A. For any K > 0 and λ > 0, define ḠK = k=1 φ(sk , ak )φ(sk , ak )⊤ + λId where
(sk , ak ) are i.i.d samples from some distribution ν over S × A. Let G = Ev [φ(s, a)φ(s, a)⊤ ]. Then for any
δ ∈ (0, 1), if K satisfies that
K ≥ max{512C 4 kG−1 k2 log(2d/δ), 4λkG−1 k}
Then with probability at least 1 − δ, it holds simultaneously for all u ∈ Rd that
2
kukḠ−1 ≤ √ kukG−1 .
k
K
Lemma 5 (Lemma 5.1, Jin et al. (2021)). Under the condition that with probability at least 1−δ, the penalty
function Γh : S × A → R in Algorithm 1 and satisfying
b ρ Vbh+1 )(s, a) − (Bρ Vbh+1 )(s, a)| ≤ Γh (s, a),
|(B ∀(s, a, h) ∈ S × A × [H], (31)
h h
we have
0 ≤ ιh (s, a) ≤ 2Γh (s, a), ∀(s, a, h) ∈ S × A × [H].
Useful facts.
Lemma 6. For any function f1 : C ⊆ R → R and f2 : C ⊆ R → R, we have
max f1 (α) − max f2 (α) ≤ max (f1 (α) − f2 (α)) .
α∈C α∈C α∈C
Lemma 7. For any positive semi-definite matrix A ∈ Rd×d and any constant c ≥ 0, we have
d
X λi
Tr A(I + cA)−1 ≤ , (32)
i=1
1 + cλi
where {λi }di=1 are the eigenvalues of A and Tr(·) denotes the trace of the given matrix.
Proof. Note that
1 1
A(I + cA)−1 = A(I + cA)−1 + (I + cA)−1 − (I + cA)−1
c c
1 1 −1
= I − (I + cA) .
c c
1
In addition, the eigenvalues of (I + cA)−1 are { 1+cλ }d . Therefore,
i i=1
d
X λi
Tr A(I + cA)−1 = .
i=1
1 + cλi
Lemma 8 (modified Lemma 4, Shi et al. (2023)). Consider any probability vector µ0 ∈ ∆(S), any fixed
uncertainty level ρ and the uncertainty set U ρ (µ0 ) satisfying Assumption 2. For any value function V : S →
[0, H], recalling the definition of [V ]α in (11), one has
Z
inf 0
µ(s′ )V (s′ )ds′ = max {Es′ ∼µ0 [V ]α (s′ ) − ρ(α − min
′
[V ]α (s′ ))}. (33)
ρ
µ∈U (µ ) S α∈[mins V (s),maxs V (s)] s
17
B Analysis for DROP: Algorithm 1
B.1 Proof of equation (10)
Recall that for any (s, a, h) ∈ S × A × [H], one has
Z
(Bρh V )(s, a) = rh (s, a) + inf φ(s, a)⊤ µh (s′ )V (s′ )ds′
µh ∈U ρ (µ0h ) S
d
X Z
⊤
= φ(s, a) θh + φi (s, a) inf µh,i (s′ )V (s′ )ds′ .
µh,i ∈U ρ (µ0h,i ) S
i=1
According to Lemma 8, for any value function V : S → [0, H] and any uncertainty set U ρ (µ0h,i ), ∀(h, i) ∈
[H] × [d] that satisfies Assumption 2, one has
Z Z
inf 0 µh,i (s′ )V (s′ )ds′ = max µh,i (s′ )[V ]α (s′ )ds′ − ρ(α − min
′
[V ]α (s′ )) .
µh,i ∈U ρ (µh,i ) S α∈[mins V (s),maxs V (s)] S s
R
Denote νhρ,V = [νh,1ρ,V ρ,V
, νh,2 ρ,V ⊤
. . . , νh,d ρ,V
] ∈ Rd , where νh,i = maxα∈[mins V (s),maxs V (s)] S
µh,i (s′ )[V ]α (s′ )ds′ −
′
ρ(α − mins′ [V ]α (s )) for any i ∈ [d]. Therefore,
(Bρh V )(s, a) = φ(s, a)⊤ θh + νhρ,V .
Algorithm 2 Two-fold-subsampling
Input: Batch dataset D;
1: Split Data: Split D into two haves Dmain and D aux , where |D main | = |D aux | = Nh /2. Denote Nhmain (s)
(resp. Nhaux (s)) as the number of sample transitions from state s at step h in Dmain (resp. Daux ).
2: Construct the high-probability lower bound Nhtrim (s) by D aux : For each s ∈ S and 1 ≤ h ≤ H,
compute r
trim aux KH
Nh (s) = max{Nh (s) − 10 Nhaux (s) log , 0}. (34)
δ
Construct the almost temporally statistically independent Dtrim : Let Dhmain (s) denote the
3:
dataset containing all transition-reward sample tuples at the current state s and step h from Dmain . For
any (s, h) ∈ S × [H], subsample min{Nhtrim(s), Nhmain (s)} transition-reward sample tuples randomly from
Dhmain (s), denoted as Dmain,sub .
Ouput: D0 = Dmain,sub .
Recall that we assume the sample trajectories in D are generated independently. Then, the following
lemma shows that (34) is a valid lower bound of Nhmain (s) for any s ∈ S and h ∈ [H], which can be obtained
by a slight modification on Lemma 3 and Lemma 7 in Li et al. (2022a)
Lemma 9. With probability at least 1−2δ, if Nhtrim (s) satisfies (34) for every s ∈ S and h ∈ [H], then D0 :=
Dmain,sub contains temporally statistically independent samples and the following bound holds simultaneously,
i.e.,
18
In addition, with probability at least 1 − 3δ, the following lower bound also holds, i.e.,
r
trim Kdbh (s, a) KH
Nh (s, a) ≥ − 5 Kdbh (s, a) log( ), ∀(s, a, h) ∈ S × A × [H].
8 δ
Proof. Let SD be the collection of all the states appearing in any batch dataset D, where |SD | ≤ K. By
changing the union bound over S to SD in the proof of Li et al. (2022a, Lemma 3), the remaining proof
/ SD . Together with Li et al. (2022a, Lemma 7), D0 contains
still holds, since Nhtrim (s) = Nhmain (s) = 0, ∀s ∈
temporally statistically independent samples if Nhtrim (s) ≤ Nhmain (s), ∀(s, h) ∈ S × [H].
to represent the model evaluation error at the h-th step of DROP. In addition, we denote the estimated weight
of the transition kernel at the h-th step by
X
∀(s, h) ∈ S × [H] : µ bh (s) = Λ−1
h
φ(sτh , aτh )1(sτh+1 = s) ∈ Rd , (36)
0
τ ∈Dh
R
where 1(·) is the indicator function. Accordingly, it holds that ν̄hV (α) = S µ bh (s′ )[Vbh+1 (s′ )]α ds′ ∈ Rd . We
b
denote the set of all the possible state occupancy distributions associated with the optimal policy π ⋆ and
any P ∈ P ρ (P 0 ) as
h i h i
Dh⋆ = d⋆,P
h (s) : P ∈ P ρ
(P 0
) = d⋆,P
h (s, πh
⋆
(s)) : P ∈ P ρ
(P 0
) , (37)
s∈S s∈S
πh }H
Here, δ ∈ (0, 1) is the confidence parameter and K is the upper bound of Nh for any h ∈ [H]. Then, {b h=1
generated by Algorithm 1, with the probability at least 1 − δ, satisfies
√ H X
X d h i
π ; ζ, P ρ ) ≤ Õ( dH)
SubOpt(b max
⋆ ⋆
Ed ⋆
h
kφi (s h , a h )1 i k Λ−1 .
dh ∈Dh h
h=1 i=1
√ H X
X d h i
π ; ζ, P ρ ) ≤ Õ( dH)
SubOpt(b max
⋆ ⋆
E ⋆
dh kφ (s , a )
i h h i Λ1 k −1 ,
dh ∈Dh h
h=1 i=1
19
B.3.2 Proof of Theorem 3
The proof of Theorem 3 can be summarized as following key steps.
Step 1: establishing the pessimistic property. To substantiate the pessimism, we heavily depend on
the following lemma, where the proof is postponed to Appendix B.3.3.
Lemma 10. Suppose all the assumptions in Theorem 3 hold and follow all the parameters setting in (38).
Then for any (s, a, h) ∈ S × A × [H], with probability at least 1 − δ, the value function {Vb }H h=1 generated by
DROP satisfies
d
X
b ρ Vbh+1 )(s, a) − (Bρ Vbh+1 )(s, a)| ≤ Γh (s, a) := γ0
|(B h h kφi (s, a)1i kΛ−1 . (39)
h
i=1
Q⋆,ρ π
b,ρ b ⋆,ρ πb,ρ b
h (s, a) ≥ Qh (s, a) ≥ Qh (s, a) and Vh (s) ≥ Vh (s) ≥ Vh (s), ∀(s, a, h) ∈ S × A × [H], (40)
if the condition (39) holds. It implies that Q b h (s, a) and Vbh (s) is the pessimistic estimates of Qπb,ρ (s, a)
h
and Vhπb,ρ (s) for any s ∈ S, respectively. Notice that if Qπhb,ρ (s, a) ≥ Qb h (s, a) for all (s, a) ∈ S × A, one
simultaneously has the following relation:
by induction, and Vhπb,ρ (s) ≥ Vbh (s) will spontaneously hold for s ∈ S.
Step 2: bounding the suboptimality gap. Notice that for any h ∈ [H] and any s ∈ S,
Vh⋆,ρ (s) − Vhπb,ρ (s) = Vh⋆,ρ (s) − Vbh (s) + Vbh (s) − Vhπb,ρ (s)
≤ V ⋆,ρ (s) − Vbh (s),
h (42)
20
In the following, we will control the value difference, i.e., Vh⋆,ρ (s) − Vbh (s), for any (s, h) ∈ S × [H]. First,
recall the definition of Vbh (cf. Line 12 in Algorithm 1) and the robust Bellman consistency equation (2). For
all s ∈ S,
which leads to
Q⋆,ρ ⋆ b ⋆ ρ ⋆,ρ ⋆ ρb ⋆ ⋆
h (s, πh (s)) − Qh (s, πh (s)) = (Ph Vh+1 )(s, πh (s)) − (Ph Vh+1 )(s, πh (s)) + ιh (s, πh (s)), ∀s ∈ S. (44)
Denote
Z
b
inf,V
Ph,s,π ⋆ (s) (·) := arg min P (s′ )Vbh+1 (s′ )ds′ . (45)
h 0
P (·)∈P ρ (Ph,s,π ⋆ (s) ) S
h
b
Pbhinf (s) = Ph,s,π
inf,V
⋆ (s) (·) and and ι⋆h (s) := ιh (s, π ⋆ (s)), ∀s ∈ S. (48)
h
Q
Pbjinf (s) = 1s .
t−1
⋆,ρ
where the equality is from VH+1 (s) = VbH+1 (s) = 0 and j=t
21
Together with (42), the sub-optimality gap defined in (9) satisfies
H
X
π ; ζ, P ρ ) ≤ Es1 ∼ζ V1⋆,ρ (s1 ) − Es1 ∼ζ Vb1 (s1 ) ≤
SubOpt(b Est ∼d⋆1:t ι⋆t (st ). (49)
t=1
22
X X
≤ max 2k φ(sτh , aτh ) ǫτh (α, Vbh+1 ) − ǫτh (α† , Vbh+1 ) k2Λ−1 + 2k φ(sτh , aτh )ǫτh (α† , Vbh+1 )k2Λ−1
α∈[0,H] 0
h
0
h
τ ∈Dh τ ∈Dh
X
≤ 8ǫ20 K 2 /λ0 + 2k φ(sτh , aτh )ǫτh (α† , Vbh+1 )k2Λ−1 , (53)
h
0
τ ∈Dh
| {z }
T2,h
for some α† ∈ N (ǫ0 , H), where the proof of the last inequality is postponed to Appendix B.3.5. Alternatively,
X
T2,h ≤ sup k φ(sτh , aτh )ǫτh (α, Vbh+1 )k2Λ−1 . (54)
α∈N (ǫ0 ,H) h
0
τ ∈Dh
Noted that the samples in D0 are temporally statistically independent, i.e., Vbh+1 is independent of Dh0 ,
bh . Therefore, we can directly control T2,h via the following lemma.
or to say, µ
Lemma 12 (Concentration of self-normalized process). Let V : S → [0, H] be any fixed vector that is
bh and α ∈ [0, H] be a fixed constant. For any fixed h ∈ [H] and any δ ∈ (0, 1), we have
independent with µ
X
PD k φ(sτh , aτh )ǫτh (α, V )k2Λ−1 > H 2 (2 · log(1/δ) + d · log(1 + Nh /λ0 )) ≤ δ.
h
0
τ ∈Dh
The proof of Lemma 12 is postponed to Appendix B.3.6. Then applying Lemma 12 and the union bound
over N (ǫ0 , H), we have
X
PD sup k φ(sτh , aτh )ǫτh (α, Vbh+1 )k2Λ−1 ≥ H 2 (2 log(H|N (ǫ0 , H)|/δ) + d log(1 + Nh /λ0 )) ≤ δ/H,
α∈N (ǫ0 ,H) h
0
τ ∈Dh
3H
for any fixed h ∈ [H]. According to Vershynin (2018), one has |N (ǫ0 , H)| ≤ ǫ0 . Taking the union bound
for any h ∈ [H], we arrive at
X 3H 2 K
sup k φ(sτh , aτh )ǫτh (α, Vbh+1 )k2Λ−1 ≤ 2H 2 log( ) + H 2 d log(1 + ), (55)
α∈N (ǫ0 ,H) 0
h ǫ0 δ λ0
τ ∈Dh
with probability exceeding 1 − δ, where we utilize Nh ≤ K for every h ∈ [H] on the right-hand side.
Combining (53), (54) and (55), we have
2 3H 2 K
T1,h ≤ 8ǫ20 K 2 /λ0 + 4H 2 log( ) + 2H 2 d log(1 + ),
ǫ0 δ λ0
with probability at least 1 − δ. Let ǫ0 = H/K and λ0 = 1. Then,
2 3HK
T1,h ≤ 8H 2 + 4H 2 log( ) + 2H 2 d log(1 + K)
δ
≤ 8H 2 + 4H 2 log(3HK/δ) + 2dH 2 log(2K)
Let ξ0 = log(3HK/δ) ≥ 1. Note that log(2K) ≤ log(3HK/δ) = ξ0 . Then, we have
2
T1,h ≤ 8H 2 + 4H 2 ξ0 + 2dH 2 ξ0 ≤ 16dH 2 ξ0 .
Therefore, with probability exceeding 1 − δ, one has
√ p X
d
b ρ Vbh+1 )(s, a) − (Bρ Vbh+1 )(s, a)| ≤
|(B dH + 4 dξ0 H kφi (s, a)kΛ−1
h h h
i=1
d
X
≤ γ0 kφi (sh , ah )kΛ−1 = Γh (s, a)
h
i=1
√
where γ0 = 6 dξ0 H and the above inequality satisfies (31).
23
B.3.4 Proof of Lemma 11
It follows Lemma 1 and (16) that
b b
b ρ Vbh+1 )(s, a) − (Bρ Vbh+1 )(s, a)| = |φ(s, a)⊤ (w
|(B bhρ,V − whρ,V )|
h h
b b
= |φ(s, a)⊤ θbh − θh | + |φ(s, a)⊤ νbhρ,V − νhρ,V |, ∀(s, a, h) × S × A × [H]. (56)
| {z } | {z }
(i) (ii)
We first bound the term (i), for any h ∈ [H]. By the update (12), we have
X
⊤ −1 τ τ
(i) = |φ(s, a) Λh φ(sh , ah )rh − φ(s, a)⊤ θh |
τ
0
τ ∈Dh
where the second equality is from Assumption 1 and (15). Applying the Cauchy-Schwarz inequality leads to
p d
X
(i) ≤ λ0 kφ(s, a)kΛ−1 kθh kΛ−1 ≤ dλ0 kφi (s, a)1i kΛ−1 , (57)
h h h
i=1
√
where the last inequality is kθh k ≤ d and kΛ−1
h k ≤ 1/λ0 such that
p
kθh kΛ−1 ≤ kΛ−1
h k
1/2
kθh k ≤ d/λ0 , ∀h ∈ [H].
h
R
Next, to bound the term (ii), we define the following notations for simplicity. Let ǫτh (α, V ) = S Ph0 (s′ |sτh , aτh )[V ]α (s′ )ds′ −
[V ]α (sτh+1 ) for any V : S → [0, H] and α ∈ [mins V (s), maxs V (s)]. Also, we define two auxiliary functions:
Z
gbh,i (α) = bh,i (s′ )[Vbh+1 ]α (s′ )ds′ − ρ(α − min
µ [Vbh+1 ]α (s′ ))
S s′
Z
0
gh,i (α) = µ0h,i (s′ )[Vbh+1 ]α (s′ )ds′ − ρ(α − min
′
[Vbh+1 ]α (s′ ))
S s
where the first inequality is due to (11), (13) as well as Lemma 6, and the last inequality is based on
φi (s, a) ≥ 0 for any (s, a) ∈ S × A from Assumption 1. Moreover,
Z Z
µ0h,i (s′ )[Vbh+1 ]α (s′ )ds′ − bh,i (s′ )[Vbh+1 ]α (s′ )ds′
µ
S S
Z X
= µ0h,i (s′ )[Vbh+1 ]α (s′ )ds′ − 1⊤i Λh
−1
φ(sτh , aτh )[Vbh+1 ]α (sτh+1 )
S 0
τ ∈Dh
24
Z X
= 1⊤ −1
i Λh Λh µ0h (s′ )[Vbh+1 ]α (s′ )ds′ − φ(sτh , aτh )[Vbh+1 ]α (sτh+1 )
S 0
τ ∈Dh
Z X Z
= 1⊤ −1
i Λh λ0 µ0h (s′ )[Vbh+1 ]α (s′ )ds′ + φ(sτh , aτh ) Ph0 (s′ |sτh , aτh )[Vbh+1 ]α (s′ )ds′ − [Vbh+1 ]α (sτh+1 )
S 0 S
τ ∈Dh
Z X
= 1⊤ −1
i Λh λ0 µ0h (s′ )[Vbh+1 ]α (s′ )ds′ + φ(sτh , aτh )ǫτh (α, Vbh+1 ) (59)
S 0
τ ∈Dh
where the first equality is from (36), the third one is due to (15) and we let
Z
τ
ǫh (α, V ) = Ph0 (s′ |sτh , aτh )[V ]α (s′ )ds′ − [V ]α (sτh+1 ),
S
for any V : S → [0, H] and α ∈ [mins V (s), maxs V (s)]. Then, we have
(iii)
since |Vbh+1 (s)| ≤ H for any s ∈ S and kΛh−1 k ≤ 1/λ0 . Then we have
p X d
X
(ii) ≤ λ0 H + max k φ(sτh , aτh )ǫτh (α, Vbh+1 )kΛ−1 kφi (s, a)1i kΛ−1 , (61)
α∈[mins V bh+1 (s)]
bh+1 (s),maxs V h h
0
τ ∈Dh i=1
for any (s, a, h) ∈ S × A × [H]. Combining (61) with (57) finally leads to (51), which completes our proof.
25
≤4ǫ20 Nh2 /λ0 ,
where the last inequality is based on kφ(s, a)k ≤ 1 and λmin (Λh ) ≥ λ0 for any (s, a, h) ∈ S × A × [H] such
that X X
τ′ τ′ ′ ′
φ(sτh , aτh )⊤ Λ−1
h φ(sh , ah ) = kφ(sτh , aτh )k2 · kφ(sτh , aτh )k2 · kΛ−1 2
h k ≤ Nh /λ0 . (62)
0
τ,τ ′ ∈Dh 0
τ,τ ′ ∈Dh
Thus, X
max k φ(sτh , aτh ) ǫτh (α, Vbh+1 ) − ǫτh (α† , Vbh+1 ) k2Λ−1 ≤ 4ǫ20 Nh2 /λ0 ≤ 4ǫ20 K 2 /λ0 ,
α∈[0,H] 0
h
τ ∈Dh
due to the fact Nh ≤ K for any h ∈ [H], which completes the proof of (53).
As shown in Jin et al. (2021, Lemma B.2), for any τ ∈ Dh0 , we have φ(sτh , aτh ) is Fh,τ −1-measurable and
ǫτh (α, V ) is Fh,τ −1 -measurable. Hence {ǫτh (α, V )}τ ∈Dh0 is stochastic process adapted to the filtration {Fh,τ }τ ∈Dh0 .
Then, we have
Z h i
(τ )∧N
EDh0 [ǫτh (α, V )|F ] = Ph0 (s′ |sτh , aτh )[V ]α − EDh0 [V (sτh+1 )]α |{(sjh , ajh )}j=1 h , {rhj , sjh+1 }τj=1
−1
ZS
= Ph0 (s′ |sτh , aτh )[V ]α − EDh0 [V (sτh+1 )]α = 0.
S
R
Note that ǫτh (α, V ) = S
Ph0 (s′ |sτh , aτh )[V ]α − [V (sτh+1 )]α for any V ∈ [0, H]S and α ∈ [0, H]. Then, we have
|ǫτh (α, V )| ≤ H.
Hence, for the fixed h ∈ [H] and all τ ∈ [H], the random variable ǫτh (α, V ) is mean-zero and H-sub-Gaussian
conditioning on Fh,τ −1. Then, we invoke the Lemma 2 with ητ = ǫτh (α, V ) and xτ = φ(sτh , aτh ). For any
δ > 0, we have
X 1/2
det(Λ )
PD k φ(sτh , aτh )ǫτh (α, V )k2Λ−1 > 2H 2 log( h
) ≤ δ
0
h δ det(λ0 Id )1/2
τ ∈Dh
1/2 d/2
Together with the facts that det(Λh ) = (λ0 + Nh )d/2 and det(λ0 Id )1/2 = λ0 , we can conclude the proof
of Lemma 12.
Φ⋆h,i (s) = (φi (s, πh⋆ (s))1i )(φi (s, πh⋆ (s))1i )⊤ ∈ Rd×d , (63)
b⋆h,i (s) = 1
(φi (s, πh⋆ (s)) i )⊤ Λ−1 ⋆
h (φi (s, πh (s)) i ) 1 (64)
With these notations in hand and recalling (17) in Theorem 1, one has
H X
X d q
V1⋆,ρ (ζ) − V1πb,ρ (ζ) ≤ 2γ0 sup Es∼d⋆h b⋆h,i (s)
⋆ ⋆
h=1 i=1 dh ∈Dh
H X
X d q
≤ 2γ0 sup Es∼d⋆h b⋆h,i (s)
⋆ ⋆
h=1 i=1 dh ∈Dh
26
H
X d q
X
= 2γ0 sup Es∼d⋆h b⋆h,i (s). (65)
⋆ ⋆
h=1 dh ∈Dh i=1
K
E b [φ(s, a)φ(s, a)⊤ ] + Id ,
16 dh
From Assumption 3,
where the second equality is because the trace is a linear mapping and the last inequality holds by Lemma
7. We further define Eh,larger = {i : E(s,a)∼d⋆h φ2i (s, a) ≥ d1 }. Due to Assumption 1, we first claim that
√
|Eh,larger | ≤ d, (70)
27
1
• If i ∈ Eh,larger , i.e., d ≤ E(s,a)∼d⋆h φ2i (s, a) ≤ 1, we have
⋆
16Crob · Ed⋆h φ2i (s, πh⋆ (s)) ⋆
16Crob
(69) ≤ ≤ , (72)
K K·
where the last inequality holds due to φ2i (s, πh⋆ (s)) ≤ 1.
Summing up the above three cases and (70), we have
d q
X X q X q
Es∼d⋆h b⋆h,i (s) ≤ Es∼d⋆h b⋆h,i (s) + Es∼d⋆h b⋆h,i (s)
i=1 i∈Eh,larger i∈E
/ h,larger
r r
⋆
16Crob ⋆
16Crob
≤ |Eh,larger | + |d − Eh,larger |
K Kd
r
p ⋆ d
≤ 8 Crob .
K
√ p
Together with (65) and setting γ0 = 6 dH log(3HK/δ), one obtains
H
X d q
X
V1⋆,ρ (ζ) − V1πb,ρ (ζ) ≤ 2γ0 sup Es∼d⋆h b⋆h,i (s)
⋆ ⋆
h=1 dh ∈Dh i=1
q p
≤ 96dH 2 Crob
⋆ /K log(3HK/δ),
which is equivalent to
X
max kφ(s, a)k1 ≥ E(s,a)∼d⋆h kφ(s, a)k1 ≥ E(s,a)∼d⋆h φi (s, a) > 1, (74)
(s,a)∈S×A
i∈Ẽh,larger
where the last inequality is from the linearity of the expectation mapping. It contradicts to our
Assumption 2, which implies kφ(s, a)k1 = 1 for any (s, a) ∈ S × A × [H].
• Then, we show that Ẽh,larger ⊆ Eh,larger : For every element i ∈ Ẽh,larger , we have
1
≤ (E(s,a)∼d⋆h φi (s, a))2 ≤ E(s,a)∼d⋆h φ2i (s, a),
d
where the second inequality is due to the Jensen’s inequality. Thus, Ẽh,larger ⊆ Eh,larger .
√
Combining these two arguments, we show that |Eh,larger | ≥ d.
28
Lemma 13. Consider δ ∈ (0, 1). Suppose Assumption 2, Assumption 4 and all conditions in Lemma 4 hold.
For any h ∈ [H], if Nh ≥ max{512 log(2Hd/δ)/κ2 , 4/κ}, we have
d
X 2
kφi (s, a)1i kΛ−1 ≤ √ , ∀(s, a) ∈ S × A
i=1
h Nh κ
2φi (s, a)
kφi (s, a)1i kΛ−1 ≤ √ , ∀(i, s, a) ∈ [d] × S × A,
h Nh κ
K
From (66), we have Nh ≥ 16 with probability exceeding 1 − 3δ, as long as K obeys (67). Together will
Lemma 13, with probability exceeding 1 − 4δ, one has
d
X 8
kφi (s, a)1i kΛ−1 ≤ √ , ∀(s, a, h) ∈ S × A × [H]
i=1
h
Kκ
as long as K ≥ max{c0 log(2Hd/δ)/κ2 , c0 log(KH/δ)/dbmin } for some sufficiently large universal constant c0 .
It follows Theorem 1 that r
ρ
√ 2 log(3HK/δ)
SubOpt(b π ; ζ, P ) ≤ 96 dH ,
Kκ
which completes the proof.
29
Algorithm 3 Distributionally Robust Pessimistic Least-Squares Value Iteration with Variance Estimation
(DROP-V)
Input: Datasets D̃0 , D0 ← Three-fold-subsampling(D); feature map φ(s, a) for (s, a) ∈ S × A;γ1 , λ1 > 0.
Construct the variance estimator:PObtain (Ṽ , π̃) ← DROP(D̃0 , φ)
τ τ τ τ ⊤
1: For every h ∈ [H], compute Λ̃h = τ ∈D 0 φ(sh , ah )φ(sh , ah ) + Id and
h
X X
νh,1 = (Λ̃h )−1 φ(sτh , aτh )Ṽh+1
2
(sτh+1 ) , νh,2 = (Λ̃h )−1 φ(sτh , aτh )Ṽh+1 (sτh+1 ) .
0
τ ∈D̃h 0
τ ∈D̃h
6: for feature i = 1, · · · , d do
ρ,σ,Vb
7: Update νbh,i via (28).
8: end for
b b
9: bhρ,σ,V = θbh + νbhρ,σ,V .
w
Pd
bhρ,σ,V − γ1 i=1 kφi (·, ·)1i kΣ−1 .
b
10: Q̄h (·, ·) = φ(·, ·)⊤ w
h
Algorithm 4 Three-fold-subsampling
Input: Batch dataset D;
1: Split Data: Split D into three haves D aux , Dmain and D var , where |D aux | = |D main | = |D var | = K/3.
Denote Nhmain (s) (resp. Nhaux (s) or Nhvar (s)) as the number of sample transitions from state s at step h
in Dmain (resp. Daux or Dvar ).
2: Construct the high-probability lower bound Nhtrim (s) by D aux : For each s ∈ S and 1 ≤ h ≤ H,
compute r
trim aux KH
Nh (s) = max{Nh (s) − 6 Nhaux (s) log , 0} (76)
δ
Construct the almost temporally statistically independent Dmain,sub and Dvar,sub : Let Dhmain (s)
3:
(resp. Dhvar (s)) be the set of all transition-reward sample tuples at state s and step h from Dmain (resp.
Dvar ). For any (s, h) ∈ S × [H], subsample min{Nhtrim (s), Nhmain (s)} (resp. min{Nhtrim (s), Nhvar (s)})
sample tuples randomly from Dhmain (s) (resp. Dhvar (s)), denoted as Dmain,sub (resp. Dmain,sub ).
Ouput: Dmain,sub , Dvar,sub .
Nhtrim (s) ≤ Nhmain (s), Nhtrim (s) ≤ Nhvar (s), ∀(s, h) ∈ S × [H]. (77)
30
In addition, with probability at least 1 − 4δ, the following bound also hold:
r
trim Kdbh (s, a) KH
Nh (s, a) ≥ − 6Kdbh (s, a) log , ∀(s, a, h) ∈ S × A × [H]. (78)
12 δ
Proof. We begin with proving the first claim (77). Let SDaux ⊂ S be the collection of all the states appearing
for the dataset Daux , where |SDaux | ≤ K/3. Without loss of generality, we assume that Daux contains the
first K/3 trajectories and satisfies
K/3
X
Nhaux (s) = 1(skh = s), ∀(s, h) ∈ S × [H],
k=1
which can be viewed as the sum of K/3 independent Bernoulli random variables. By the union bound and
the Bernstein inequality,
X
aux K b aux K b
P ∃(s, h) ∈ SDaux × [H] : Nh (s) − dh (s) ≥ t ≤ P Nh (s) − dh (s) ≥ t
3 3
s∈SDaux ,h∈[H]
2KH t2 /2
≤ exp − ,
3 vs,h + t/3
for any t ≥ 0, where
K Kdbh (s)
Var[1(skh = s)] ≤
vs,h = .
3 3
Here, we abuse the notation Var to represent the variance of the Bernoulli distributed 1(skh = s). Then, with
probability at least 1 − 2δ/3, we have
r
aux K b KH 2 KH
Nh (s) − dh (s) ≤ 2vs,h log( ) + log( )
3 δ 3 δ
r
KH KH
≤ Kdbh (s) log( ) + log( ), ∀(s, h) ∈ S × [H]. (79)
δ δ
Similarly, with probability at least 1 − 2δ/3, we have
r
main K b KH KH
Nh (s) − dh (s) ≤ Kdbh (s) log( ) + log( ), ∀(s, h) ∈ S × [H]. (80)
3 δ δ
31
Also from (79),
r
K KH KH K
Nhaux (s) ≥ dbh (s) − Kdbh (s) log( ) − log( ) ≥ dbh (s).
3 δ δ 6
Therefore, with probability exceeding 1 − 4δ/3
r r
HK √ HK
Nhtrim (s) = Nhaux (s) −6 Nhaux (s) log aux
≤ Nh (s) − 6 Kdbh (s) log
δ δ
r r
HK 1 HK
≤ Nhaux (s) − 2 Kdbh (s) log − Kdbh (s) log
δ 3 δ
r
HK HK
≤ Nhaux (s) − 2 Kdbh (s) log − 2 log
δ δ
main
≤ Nh (s),
holds, with probability at least 1 − 4δ/3. Therefore, we can also guarantee that Nhtrim (s) ≤ Nhvar (s) with
probability at least 1 − 4δ/3, for any (s, h) ∈ S × [H].
Putting these two results together, we prove the first claim (77).
Next, we will establish the second claim (78). To begin with, we claim the following statement holds
with probability exceeding 1 − 2δ/3,
r
trim trim b KH KH
Nh (s, a) ≥ Nh (s)πh (a|s) − 2Nhtrim(s)πhb (a|s) log( ) − log , ∀(s, a, h) ∈ S × A × [H], (83)
δ δ
conditioned on the high-probability event that the first part (77) holds. In the sequel, we discuss the following
two cases, provided that the inequality (83) holds.
• Case 1:Kdbh (s, a) = Kdbh (s)πhb (a|s) > 864 log KH
δ . From (79), with probability exceeding 1 − 2δ/3, one
has r
aux K b KH KH K KH
Nh (s) ≥ dh (s) − Kdbh (s) log( ) − log( ) ≥ dbh (s) ≥ 144 log .
3 δ δ 6 δ
Together with the definition (76), we have
r
KH 1 K b
Nhtrim (s) ≥ Nhaux (s) − 6 Nhaux (s) log ≥ Nhaux (s) ≥ d (s).
δ 2 12 h
Therefore,
K b KH
Nhtrim (s)πhb (a|s) ≥ d (s)πhb (a|s) ≥ 72 log .
12 h δ
Combining with (83), one can derive
r
Kdbh (s, a) 1 KH KH
Nhtrim (s, a) ≥ − Kdbh (s, a) log( ) − log
12 6 δ δ
r
b
Kdh (s, a) KH
≥ − 6Kdbh (s, a) log .
12 δ
with probability exceeding 1 − 4δ/3
32
• Case 2: Kdbh (s, a) ≤ 864 log KH
δ . From (76), one has
r
Kdbh (s, a) KH
Nhtrim (s, a) ≥0≥ − 6Kdbh (s, a) log .
12 δ
By integrating these two cases, we can claim (78) is valid with probability exceeding 1 − 4δ/3, as long as the
inequality (83) holds under the condition of the high-probability event described in the the first part (77).
Thus, the second claim (78) holds with probability at least 1 − 4δ.
Proof of inequality (83). First, we can observe that the inequality (83) holds if Nhtrim (s)πhb (s, a) ≤
2 log KH trim
δ . Thus, we focus on the other case that Nh (s)πhb (s, a) > 2 log KH
δ . Denote that
KH
E = {(s, a, h) ∈ S × A × [H]| Nhtrim (s)πhb (a|s) > 2 log( )}.
δ
Noticed that from Algorithm 4, one has that |E| ≤ KH 3 . Supposing that the first claim (77) holds, one has
Nhtrim (s) = min{Nhtrim (s), Nhmain (s), Nhvar (s)}. Therefore, Nhtrim (s, a) can be viewed as the sum of Nhtrim (s)
independent Bernoulli random variables, where each is with the mean πhb (a|s). Then, by the union bound
and the Bernstein inequality,
P ∃(s, a, h) ∈ E : Nhtrim (s, a) − Nhtrim (s)πhb (a|s) ≥ t
X
2KH t2 /2
≤ P Nhtrim (s, a) − Nhtrim (s)πhb (a|s) ≥ t ≤ exp − ,
3 vs,h + t/3
(s,a,h)∈E
33
Theorem 4. Consider the dataset D0 and D̃0 used for constructing the variance estimator in DROP-V and
δ ∈ (0, 1). Suppose that both D0 and D̃0 contain Nh < K sample tuples at every h ∈ [H]. Assume that
conditional on {Nh }h∈[H] , the sample tuples in D0 and D̃0 are statistically independent, where
Kdbh (s, a)
Nh (s, a) ≥ , ∀(s, a, h) ∈ S × A × [H].
24
Suppose that Assumption 1, 2, and 3 hold. In DROP-V, we set
√
λ1 = 1/H 2 , γ1 = ξ1 d, where ξ1 = 66 log(3HK/δ) (85)
πh }H
Then, with probability at least 1 − 7δ, {b h=1 generated by DROP-V satisfies
√ X H X
d h i
π ; ζ, P ρ ) ≤ Õ( d)
SubOpt(b max
⋆ ⋆
Ed ⋆
h
kφi (s h , a h )1 i k (Σ⋆ )−1 ,
h
dh ∈Dh
h=1 i=1
√
if d ≥ H and K ≥ max{Õ(H 4 /κ2 ), Õ(H 6 d/κ)}, where Σ⋆h is defined in (30).
As the construction in Algorithm 4, {Nhtrim(s)}s∈S,h∈[H] is computed using Daux that is independent
of D0 := Dmain,sub and D̃0 := Dvar,sub . Moreover, from Lemma 14 and Lemma 15 in the Section C.2,
{Nhtrim (s)}s∈S is a valid sampling number and Dh0 and D̃h0 can be treated as being temporally statistically
independent samples and X
Nhtrim (s) ≥ K/24
s∈S
c1 log KH
with probability exceeding 1 − 4δ, as long as K ≥ P b
δ /dmin for some sufficiently large c1 .
Therefore, by invoking Theorem 3 with Nh := s∈S N trim (s), we have
√ X H X
d h i
π ; ζ, P ρ ) ≤ Õ( d)
SubOpt(b max
⋆ ⋆
Ed⋆h kφi (sh , ah )1i k(Σ⋆h )−1 ,
dh ∈Dh
h=1 i=1
√
with probability exceeding 1 − 11δ, if e 4 /κ2 ), O(H
d ≥ H and K ≥ max{O(H e 6 d/κ), O(1/d
e b
min )}.
to represent the model evaluation error at the h-th step of our proposed Algorithm 3. In addition, For any
h ∈ [H], we let Γ⋆,σ
h : S → R satisfy
Γ⋆,σ σ ⋆
h (s) = Γh (s, πh (s)), ∀s ∈ S. (87)
Also, denote VPh0 V (s, a) = max{1, VarPh0 [V ](s, a)} for any V : S → [0, H] and any (s, a, h) ∈ S × A × [H].
Similar to Lemma 10, we have the following key lemma, where the proof can be found in Appendix C.5.
Lemma 16. Suppose all the assumptions in Theorem 4 hold and follow all the parameters setting in (85). In
addition, suppose that the number of trajectories K ≥ max{Õ(H 4 /κ2 ), Õ(H 6 d/κ)}.Then for any (s, a, h) ∈
S × A × [H], with probability at least 1 − 7δ, one has
d
X
b ρ,σ Vbh+1 )(s, a) − (Bρ Vbh+1 )(s, a)| ≤ Γσ (s, a) := γ1
|(B h h h kφi (s, a)1i kΣ−1 (88)
h
i=1
In addition,
d
X d
X
γ1 kφi (s, a)1i kΣ−1 ≤ 2γ1 kφi (s, a)1i k(Σ⋆h )−1 .
h
i=1 i=1
34
Next, following the same steps in Appendix B.3.2, with probability exceeding 1 − 7δ, one has
H X
X d h i
π ; ζ, P ρ ) ≤ 2γ1
SubOpt(b max
⋆ ⋆
Ed ⋆
h
kφi (s h , a h )1 i k Σ−1
dh ∈Dh h
h=1 i=1
H X
X d h i
≤ 4γ1 max
⋆ ⋆
Ed ⋆
h
kφi (s h , a h )1 i k (Σ⋆ )−1 .
h
dh ∈Dh
h=1 i=1
R
Ph0 (s′ |sτh ,aτh )[V ]α (s′ )ds′ −[V ]α (sτh+1 )
where ǫτ,σ
h (α, V ) =
S
bh (sτh ,aτh )
σ for any V : S → [0, H], any τ ∈ Dh0 and α ∈
[mins V (s), maxs V (s)].
Letting λ1 = 1/H 2 in (89), then Lemma 17 becomes
Due to the correlation between α and Vbh+1 , we also apply the uniform concentration with the minimal
ǫ1 -covering set N (ǫ1 , H) for α defined in (52). Similar to (53), there exists α† ∈ N (ǫ1 , H) s.t.
2
X φ(sτ , aτ ) τ,σ
2 † b
M1,h ≤ 8ǫ21 H 2 K 2 +2 h h
τ , aτ ) ǫh (α , Vh+1 ) ≤ 8 + 2M2,h , (91)
0
b
σh (s h h
τ ∈Dh
Σ−1
| {z h
}
M2,h
1 1
where the second inequality holds if ǫ1 ≤ HK . Without the loss of generality, we let ǫ1 = HK in the following
analysis. The detailed proof of (91) is postponed to Appendix C.5.2.
Next, we will focus on bound the term M2,h . Before proceeding, we first define the σ-algebra
(τ +1)∧Nh
Fh,τ = σ({(sjh , ajh )}j=1 , {rhj , sjh+1 }τj=1 ),
for any fixed h ∈ [H] and τ ∈ Dh0 . Noted that the samples in D0 are temporally statistically independent,
i.e., Vbh+1 is independent of Dh0 for any h ∈ [H]. In addition, {b σh2 }h∈[H] is constructed using an additional
φ(sτ ,aτ )
dataset D̃0 , which is also independent of D0 . Thus, for any h ∈ [H] and τ ∈ Dh0 , we have σbh (shτ ,ahτ ) is
h h
φ(sτh ,aτh )
Fh,τ −1 -measurable and | σ τ τ | ≤ 1. Also, ǫ
bh (sh ,ah )
τ,σ
(α† , Vbh+1 ) is Fh,τ -measurable,
h
† b † b
E[ǫτ,σ
h (α , Vh+1 )|Fh,τ −1 ] = 0, |ǫτ,σ
h (α , Vh+1 )| ≤ H,
35
It follows the independence between σ bh2 and Dh0 that
"R #
P 0 (s′ |sτh , aτh )[Vbh+1 ]α (s′ )ds′ − [Vbh+1 ]α (s)
S h τ τ
VarPh0 [Vbh+1 ]α (sτh , aτh )
VarPh0 (s ,
h h a ) =
σbh (sτh , aτh ) bh2 (sτh , aτh )
σ
VPh0 Vbh+1 (sτh , aτh )
≤ , (92)
bh2 (sτh , aτh )
σ
for any h ∈ [H] and τ ∈ Dh0 , where the inequality is from VPh0 Vbh+1 (sτh , aτh ) = max{1, VarPh0 [Vbh+1 ](sτh , aτh )}.
The analysis of the improvement on sample complexity heavily relies on the following lemma about the
variance estimation error, where the proof is deferred to Appendix C.5.3.
Lemma 18. Suppose that D0 and D̃0 satisfy all the conditions imposed in Theorem 4. Assume the Assump-
tion 1, 2 and 3 hold . For any h ∈ [H] and given the nominal transition kernel Ph0 : S × A → S, the Vbh+1
generated by the DROP-V on D0 and σ bh2 generated by DROP on D̃0 satisfies
√
⋆,ρ 70cb H 3 d
VPh0 Vh+1 (sτh , aτh ) − σ
bh2 (sτh , aτh ) ≤ √ , ∀τ ∈ Dh0 , (93)
Kκ
√
b τ τ ⋆,ρ τ τ 320cb H 3 d
VPh0 Vh+1 (sh , ah ) − VPh0 Vh+1 (sh , ah ) ≤ √ , ∀τ ∈ Dh0 , (94)
Kκ
where cb = 12 log(3HK/δ) and K ≥ c1 log(2Hd/δ)H 4 /κ2 for some sufficiently large universal constant c1 ,
with probability at least 1 − 6δ.
bh2 (s, a) ≤ H 2 for any (s, a, h) ∈ S × A × [H]. Invoking the Lemma 18, we have
Notice that 1 ≤ σ
X φ(sτ , aτ ) τ,σ p
sup h h b ≤ 16 d log(1 + H 2 K/d) log(12H 3 K 3 /δ) + 4H log(12H 3 K 3 /δ)
b
σ (s τ , aτ ) ǫh (α, Vh+1 )
α∈N (ǫ1 ,H) 0 h h
τ ∈Dh
Σ−1
h
√
≤ c1 d
with probability 1 − 7δ and for a fixed α ∈ N (ǫ1 , Vbh+1 ), for c1 = 40 log(3HK/δ). Then, the equation (89)
becomes
d
√ √ √ √ X
b ρ,σ Vbh+1 )(s, a) − (Bρ Vbh+1 )(s, a)| ≤ (2 d + 2 2 + 2c1 d)
|(B h h kφi (s, a)1i kΣ−1
h
i=1
d
X
= γ1 kφi (s, a)1i kΣ−1 := Γσh (s, a)
h
i=1
τ ∈D 0h τ ∈D 0 h
Kjiκ
36
X φτ (sτ , aτ )φτ (sτ , aτ )⊤
h h h h h h
⋆,ρ
0
2VPh0 Vh+1 (sτh , aτh )
τ ∈Dh
3
√
70c√
bH d
where the last inequality is from Kκ
≤ 12 . Then, we obtain Σh 12 Σ⋆h for any h ∈ [H], which completes
the proof.
b R
such that ν̄hV (α) = S bσh (s′ )[Vbh+1 (s′ )]α ds′ defined in the update (26). Similar to (59), by letting ǫτ,σ
µ h (α, V ) =
R
P 0 (s′ |sτh ,aτh )[V ]α (s′ )ds′ −[V ]α (sτh+1 )
S h
b (sτh ,aτh )
σ for any V : S → [0, H], any τ ∈ Dh0 and α ∈ [mins V (s), maxs V (s)], we
have
Z Z
′ ′ ′
0
µh,i (s )[Vh+1 ]α (s )ds − bσh,i (s′ )[Vh+1 ]α (s′ )ds′
µ
S S
Z X φ(sτ , aτ ) Z
= 1⊤i Σh
−1
λ1 µ0h (s′ )[Vh+1 ]α (s′ )ds′ + h h
2 (sτ , aτ ) Ph0 (s′ |sτh , aτh )[Vh+1 ]α (s′ )ds′ − [Vh+1 ]α (sτh+1 )
S 0
b
σh h h S
τ ∈Dh
Z X φ(sτ , aτ ) τ,σ
= 1⊤i Σh
−1
λ1 µ0h (s′ )[Vh+1 ]α (s′ )ds′ + h h
ǫ (α, Vh+1 )
S 0
bh (sτh , aτh ) h
σ
τ ∈Dh
Then, we obtain
Z
φi (s, a) µσh,i (s′ ) − µh,i (s′ ))[Vh+1 ]α (s′ )ds′
(b
S
Z X φ(sτ , aτ ) τ,σ
≤ φi (s, a)1⊤i Σh
−1
λ1 µ0h (s′ )[Vh+1 ]α (s′ )ds′ + h h
τ , aτ ) ǫh (α, Vh+1 )
S 0
b
σ (s h h
τ ∈Dh
Z X φ(sτ , aτ ) τ,σ
≤ kφi (s, a)1i kΣ−1 λ k µ 0 ′
(s )[V ] (s ′
)ds ′
k −1 + k
h h
ǫh (α, Vh+1 )kΣ−1
, (95)
h
1 h h+1 α Σh τ τ
S b(sh , ah )
σ h
| {z } τ ∈Dh 0
(ii)
where the last inequality follows Cauchy-Schwarz inequality. Moreover, the term (ii) in (95) can be further
simplified Z
−1 21
p
(ii) ≤ λ1 kΣh k k µ0h (s′ )[Vh+1 ]α (s′ )ds′ k ≤ λ1 H,
S
since V (s) ≤ H for any s ∈ S and kΣh−1 k ≤ 1/λ1 . Then we have
p X φ(sτ , aτ ) τ,σ Xd
(i) ≤ λ1 H + max k h h
ǫ h (α, Vh+1 )k Σ−1 kφi (s, a)1i kΣ−1 , (96)
α∈[mins Vh+1 (s),maxs Vh+1 (s)] 0
b (sτh , aτh )
σ h
i=1
h
τ ∈Dh
37
which concludes our proof of (89).
X φ(sτ , aτ ) τ,σ
max k h h b 2
bh+1 (s)]
bh+1 (s),maxs V b
σ (s τ , aτ ) ǫh (α, Vh+1 )kΣ−1
h
α∈[mins V 0 h h h
τ ∈Dh
X φ(sτ , aτ ) τ,σ X φ(sτ , aτ ) τ,σ
≤ max 2k h h
ǫh (α, Vbh+1 ) − ǫτ,σ † b 2
h (α , Vh+1 ) kΣ−1 + 2k h h
ǫ (α† , Vbh+1 )k2Σ−1 ,
α∈[0,H] 0
τ τ
bh (sh , ah )
σ h
0
bh (sτh , aτh ) h
σ h
τ ∈Dh τ ∈Dh
|ǫτ,σ τ,σ † †
h (α, V ) − ǫh (α , V )| ≤2|α − α | ≤ 2ǫ1 .
X ⊤ τ′ τ′ h τ ′ ,σ i
φ(sτh , aτh ) −1 φ(sh , ah )
′
= τ τ Σh τ ′ τ ′ ǫτ,σ τ,σ †
h (α, V ) − ǫh (α , V ) ǫh (α, V ) − ǫτh ,σ (α† , V )
0
bh (sh , ah )
σ bh (sh , ah )
σ
τ,τ ′ ∈Dh
X ⊤ ′ ′
φ(sτh , aτh ) τ τ
−1 φ(sh , ah )
≤ Σ h ′ ′ · 4ǫ21
0
bh (sτh , aτh )
σ bh (sτh , aτh )
σ
τ,τ ′ ∈Dh
where the last inequality is based on kφ(s, a)k2 ≤ 1, σ bh (s, a) ≥ 1 for any (s, a, h) ∈ S × A × [H] and
λmin (Σh ) ≥ λ1 = H12 for any h ∈ [H] such that
X X
τ′ τ′ ′ ′
φ(sτh , aτh )⊤ Σ−1
h φ(sh , ah ) = kφ(sτh , aτh )k2 · kφ(sτh , aτh )k2 · kΣ−1 2
h k ≤ Nh /λ1 . (97)
0
τ,τ ′ ∈Dh 0
τ,τ ′ ∈Dh
1
where the second inequality holds if ǫ1 ≤ HK .
d h Ṽh+1 (s, a)}. In addition, recall that VP 0 V (s, a) = max{1, VarP 0 V (s, a)}
bh2 (s, a) = max{1, Var
such that σ h h
for any value function V : S → [0, H] and any (s, a, h) ∈ S × A × [H]. Then, we can decompose the target
terms by
⋆,ρ
VPh0 Vh+1 (sτh , aτh ) − σ
bh2 (sτh , aτh )
38
⋆,ρ
≤ VPh0 Ṽh+1 (sτh , aτh ) − σ
bh2 (sτh , aτh ) + VPh0 Vh+1 (sτh , aτh ) − VPh0 Ṽh+1 (sτh , aτh )
d h Ṽh+1 (sτ , aτ ) + VarP 0 V ⋆,ρ (sτ , aτ ) − VarP 0 Ṽh+1 (sτ , aτ ) ,
≤ VarPh0 Ṽh+1 (sτh , aτh ) − Var h h h h+1 h h h h h
| {z } | {z }
(a) (b)
for every h ∈ [H], where the last inequality is based on the non-expansiveness of max{1, ·}. Similarly,
Step 1: Control (a). For each τ ∈ Dh0 , we decompose the term (a) by
where the last inequality is based on a2 − b2 = (a + b)(a − b) for any a, b ∈ R. In the sequel, we control (a1 )
and (a2 ), respectively. Before continuing, we first define µ̃h,i : S → R is the i-th coordinate of
X
φ(sτh , aτh )1(s = sτh+1 ) ∈ Rd
′ ′ ′
µ̃h (s) = (Λ̃h )−1
0
τ ′ ∈D̃h
R R
such that νh,1 = S µ̃h (s′ )Ṽh+1
2
(s′ )ds′ ∈ Rd and νh,2 = S µ̃h (s′ )Ṽh+1 (s′ )ds′ . With this new notation, we
reformulate (a1 ) as Z
(a1 ) = |φ(sτh , aτh ) (µ̃h (s′ ) − µ0h (s′ ))Ṽh+1
2
(s′ )ds′ |.
S
Following the steps in Lemma 11, i.e., the equations (59)-(61) with λ0 = 1, we can obtain
X d
X
kφi (sτh , aτh )1i kΛ̃−1
′ ′ ′
(a1 ) ≤ H + max k τ τ τ 2
φ(sh , ah )ǫh (α, Ṽh+1 )k(Λ̃h )−1
h
α∈[mins Ṽh+1 (s),maxs Ṽh+1 (s)] 0
τ ′ ∈D̃h i=1
X d
X
√ √
kφi (sτh , aτh )1i kΛ̃−1 ,
′ ′ ′
≤ H + 2 2H + 2 sup k φ(sτh , aτh )ǫτh (α, Ṽh+1
2
)k(Λ̃h )−1 (99)
h
α∈N (ǫ0 ,H) 0 i=1
τ ′ ∈D̃h
R
where ǫτh (α, V ) = S Ph0 (s′ |sτh , aτh )[V ]α (s′ )ds′ − [V ]α (sτh+1 ) for any V : S → [0, H], τ ′ ∈ D̃h0 and α ∈
[mins V (s), maxs V (s)]. Since Ṽh+1 is independent of D̃h0 , we can directly apply Lemma 2 following the
same arguments in Lemma 12. Therefore, with probability exceeding 1 − δ, we have
X ′ ′ ′ p √
sup k φ(sτh , aτh )ǫτh (α, Ṽh+1
2
)k(Λ̃h )−1 ≤ H 2 2 log(3HK/δ) + d log(1 + K) ≤ ca H 2 d, (100)
α∈N (ǫ0 ,H) 0
τ ′ ∈D̃h
39
where ca = 3 log(3HK/δ). Therefore, with probability exceeding 1 − δ,
√ Xd
(a1 ) ≤ 6ca H 2 d kφi (sτh , aτh )1i kΛ̃−1 .
h
i=1
√ Xd
(a) ≤ (a1 ) + 2H(a2 ) ≤ 12ca H 2 d kφi (sτh , aτh )1i kΛ̃−1
h
i=1
Denote ι̃h (s, a) = Bρh Ṽh+1 (s, a) − Q̃h (s, a), for any (s, a) ∈ S × A and
Z
inf,Ṽ
Ph,s,π⋆ (s) (·) := arg min P (s′ )Ṽh+1 (s′ )ds′ (102)
h 0
P (·)∈P ρ (Ph,s,π ⋆ (s) ) S
h
⋆,ρ
for any s ∈ S and h ∈ [H], where the equality is from VH+1 (s) = ṼH+1 (s) = 0 for any s ∈ S and we denote
t−1
Y t−1
Y
P̃jinf (s) = 1s and d˜⋆h:t = d⋆h P̃jinf ∈ Dt⋆ .
j=t j=h
40
Note that for any (s, a, h) ∈ S × A × [H]
√ Xd
⋆,ρ
max Vh+1 (s) − Ṽh+1 (s) ≤ max cb H 2 d kφi (s, a)1i kΛ̃−1 .
s∈S (s,a)∈S×A h
i=1
Therefore,
√ Xd
(b) ≤ max 3
4cb H d kφi (s, a)1i kΛ̃−1
(s,a)∈S×A h
i=1
where
b ρ,σ Vbh+1 )(s, a) − (Bρ Vbh+1 )(s, a)| + Γσ (s, a).
|ισh (s, a)| ≤ |(B (107)
h h h
√ √ X
d
b ρ,σ Vbh+1 )(s, a) − (Bρ Vbh+1 )(s, a)| ≤ 5 d + 2M3,h
|(B h h kφi (s, a)1i kΣ−1 , (108)
h
i=1
where cb = 12 log(3HK/δ). Substituting (109) into (108) and combining with (107) result in
√ Xd
|ισh (s, a)| ≤ (5 + cb + ξ1 )H d kφi (s, a)1i kΣ−1 ,
h
i=1
√ Xd
≤ 8cb H d kφi (s, a)1i kΣ−1 , ∀(s, a, h) ∈ S × A × [H].
h
i=1
Therefore,
√ Xd
⋆,ρ
(c) ≤ 4H max Vh+1 (s) − Vbh+1 (s) ≤ sup 32cb H 3 d kφi (s, a)1i kΣ−1
s∈S (s,a)∈S×A h
i=1
41
Step 4: Finishing up Then, with probability at least 1 − 3δ, we have
√ Xd
⋆,ρ
VPh0 Vh+1 (sτh , aτh ) − σ
bh2 (sτh , aτh ) ≤ sup 7cb H 3 d kφi (s, a)1i kΛ̃−1 , (110)
h
(s,a)∈S×A i=1
42