Instance-Dependent ' - Bounds For Policy Evaluation in Tabular Reinforcement Learning

Instance-dependent `∞-bounds for policy evaluation in
tabular reinforcement learning

Ashwin Pananjady? and Martin J. Wainwright†,‡
? Simons Institute for the Theory of Computing, UC Berkeley

† Departments of EECS and Statistics, UC Berkeley
‡ Voleon Group, Berkeley
September 17, 2020

arXiv:1909.08749v2 [stat.ML] 15 Sep 2020
Abstract
Markov reward processes (MRPs) are used to model stochastic phenomena arising in opera-
tions research, control engineering, robotics, and artificial intelligence, as well as communication
and transportation networks. In many of these cases, such as in the policy evaluation problem
encountered in reinforcement learning, the goal is to estimate the long-term value function of
such a process without access to the underlying population transition and reward functions.
Working with samples generated under the synchronous model, we study the problem of esti-
mating the value function of an infinite-horizon, discounted MRP on finitely many states in the
`∞ -norm. We analyze both the standard plug-in approach to this problem and a more robust
variant, and establish non-asymptotic bounds that depend on the (unknown) problem instance,
as well as data-dependent bounds that can be evaluated based on the observations of state-
transitions and rewards. We show that these approaches are minimax-optimal up to constant
factors over natural sub-classes of MRPs. Our analysis makes use of a leave-one-out decoupling
argument tailored to the policy evaluation problem, one which may be of independent interest.
1 Introduction
A variety of applications spanning science and engineering use Markov reward processes as models
for real-world phenomena, including queuing systems, transportation networks, robotic exploration,
game playing, and epidemiology. In some of these settings, the underlying parameters that govern
the process are known to the modeler, but in others, these must be estimated from observed data.
A salient example of the latter setting, which forms the main motivation for this paper, is the
policy evaluation problem encountered in Markov decision processes (MDPs) and reinforcement
learning [Ber95a, Ber95b, SB18]. Here an agent operates in an environment whose dynamics are
unknown: at each step, it observes the current state of the environment, and takes an action that
changes its state according to some stochastic transition function determined by the environment.
The goal is to evaluate the utility of some policy—that is, a mapping from states to actions, where
utility is measured using rewards that the agent receives from the environment. These rewards are
usually assumed to be additive over time, and since the policy determines the action to be taken at
each state, the reward obtained at any time is simply a function of the current state of the agent.
Thus, this setting induces a Markov reward process (MRP) on the state space, in which both the
underlying transitions and rewards are unknown to the agent. The agent only observes samples of
state transitions and rewards.
1
Given these samples, the goal of the agent is to estimate the value function of the MRP. As
noted above, in the context of Markov decision processes (MDPs), this problem is known as policy
evaluation. The value function evaluated at a given state measures the expected long-term reward
accumulated by starting at that state and running the underlying Markov chain. In applications,
this value function encodes crucial information about the MRP. For example, there are MRPs
in which the value function corresponds to the probability of a power grid failing [FMP08], the
taxi times of flights in an airport [BGSL08], or the value of a board configuration in a game of
Go [SSM07]. Moreover, policy evaluation is an important component of many policy optimization
algorithms for reinforcement learning, which use it as a sub-routine while searching for good policies
to deploy in the environment.
The focus of this paper is on understanding the policy evaluation problem in finite-state (or
tabular) MRPs in an instance-dependent manner, focusing on the the generative setting in which
the agent has access to a simulator that generates samples from the underlying MRP. In particular,
we would like guarantees on the sample complexity of policy evaluation—defined as the number of
samples required to obtain a value function estimate of some pre-specified error tolerance—as a
function of the agent’s environment, i.e., the transition and reward functions induced by the policy
being evaluated. Local guarantees of this form provide more guidance for algorithm design in finite
sample settings than their worst-case counterparts. Indeed, this viewpoint underpins the important
sub-field of local minimax complexity studied widely in the statistics and optimization literatures
(e.g., [CL04, ZCDL16, WW20]), as well as in more recent work on online reinforcement learning
algorithms [ZB19].
As a natural first step towards providing local guarantees for the policy evaluation problem,
we analyze the plug-in estimator for the problem, which estimates the underlying transition and
reward functions from the samples, and outputs the value function of the MRP in which these
estimates correspond to the ground truth parameters. We also analyze a robust variant of this
approach, and provide minimax lower bounds that hold over subsets of the parameter space.
Related work: Markov reward processes have a rich history originating in the theory of Markov
chains and renewal processes; we refer the reader to the classical books [Fel66] and [Dur99] for
introductions to the subject. The policy evaluation problem has seen considerable interest in
the stochastic control and reinforcement learning communities, and various algorithms have been
analyzed in both asymptotic [Bor98, Tad04] and non-asymptotic [LS18, SY19] settings. Chapter 3
of the monograph by Szepesvári [Sze09] provides a brief introduction to these methods, and the
recent survey by Dann et al. [DNP14] focuses on methods based on temporal differences [Sut88].
In the language of temporal difference (TD) algorithms, the plug-in approach that we analyze
corresponds to the least squares temporal difference (LSTD) solution [BB96] in the tabular setting,
without function approximation. While TD algorithms for policy evaluation have been analyzed
by many previous papers, their focus is typically either on (i) how function approximation affects
the algorithm [TVR97], (ii) asymptotic convergence guarantees [Bor98, Tad04] or (iii) establishing
convergence rates in metrics of the `2 -type [Tad04, LS18, SY19]. Since `2 -type metrics can be
associated with an inner product, many specialized analyses can be ported over from the literature
on stochastic optimization (e.g., [BM11, NJLS09]).1 On the other hand, our focus is on providing
non-asymptotic guarantees in the `∞ -error metric, since these are particularly compatible with
1
Here we have only referenced some representative papers; see the references in Szepesvári [Sze09] for a broader
overview.
2
Problem Algorithm Paper Model Sample-size Guarantee Technique
[KS99], [Kak03] Synchronous Non-asymptotic Global, `∞ Hoeffding

State-action Plug-in
[AMK13] Synchronous Non-asymptotic Global, `∞ Bernstein
value
Global,
estimation Stochastic [BM00] Synchronous Asymptotic ODE method
conv. in dist.
in MDPs approximation:
Local, Asymptotic
Q-learning & [DM17] Synchronous Asymptotic
conv. in dist. normality
variants
[Wai19b], Bernstein,
Synchronous Non-asymptotic Local, `∞
[CMSS20] Moreau envelope
[AMGK11],
Bernstein,
[SWWY18], Synchronous Non-asymptotic Global, `∞
variance reduction
[Wai19c]
Optimal [AMK13] Synchronous Non-asymptotic Global, `∞ Bernstein
Plug-in
value Bernstein +
[AKY20] Synchronous Non-asymptotic Global, `∞
estimation decoupling
in MDPs Stochastic Bernstein +
[SWW+ 18] Synchronous Non-asymptotic Global, `∞
approximation variance reduction
Bernstein +
Plug-in Current paper Synchronous Non-asymptotic Local, `∞
leave-one-out
Policy [Tad04], [PJ92],
Synchronous, Local, `2 and Averaging,
evaluation Stochastic [JJS94], [BM00], Asymptotic
trajectories conv. in dist. ODE method
in MRPs approximation: [DM17]
TD-learning [LS18], [BRS18], Synchronous, Averaging,
Non-asymptotic Global, `2
[SY19] trajectories martingales
Global oracle
[TVR97], inequality Asymptotic
TD-learning Trajectories Asymptotic
[UKM+ 08] Local, normality
with function
conv. in dist.
approximation
[BRS18],
Synchronous, Global and Population to
[DMR19], Non-asymptotic
trajectories local, `2 sample
[DSTM18]
Median of
Current paper Synchronous Non-asymptotic Local, `∞ Robustness
means
Table 1. A subset of results in the tabular and infinite-horizon discounted setting, both for pol-
icy evaluation in MRPs and policy optimization in MDPs. For a broader overview of results, see
Gosavi [Gos09] for the setting of infinite-horizon average reward, and Dann and Brunskill [DB15] for
the episodic setting. The “technique” vertical of the table is only meant to showcase a representative
subset of those employed. Our contributions are highlighted in red.
3
policy iteration methods. In particular, policy iteration can be shown to converge at a geometric
rate when combined with policy evaluation methods that are accurate in `∞ -norm (e.g., see the
books [AJK19, BT96]). Also, given that we are interested in fine-grained, instance-dependent
guarantees, we first study the problem without function approximation.
As briefly alluded to before, there has also been some recent focus on obtaining instance-
dependent guarantees in online reinforcement learning settings [MMM14, SJ19, ZKB19]. These
analyses have led to more practically applicable algorithms for certain episodic MDPs [ZB19, JA18]
that improve upon worst-case bounds [AOM17]. Recent work has also established some instance-
dependent bounds for the problem of state-action value function estimation in Markov decision
processes, for both ordinary Q-learning [Wai19b] and a variance-reduced improvement [Wai19c].
However, we currently lack the localized lower bounds that would allow us to understand the
fundamental limits of the problem in a more local sense, except in some special cases for asymptotic
settings; for instance, see Ueno et al. [UKM+ 08] and Devraj and Meyn [DM17] for bounds of this
type for LSTD and stochastic approximation, respectively. We hope that our analysis of the simpler
policy evaluation problem will be useful in broadening the scope of such guarantees.
Portions of our analysis exploit a decoupling that is induced by a leave-one-out technique. We
note that leave-one-out techniques are frequently used in probabilistic analysis (e.g., [BE02, dlPG12,
MWCC18]). In the context of Markov processes, arguments that are related to but distinct from
those in this paper have been used in analyzing estimates of the stationary distribution of a Markov
chain [CFMW19], and for analyzing optimal policies in reinforcement learning [AKY20].
For the reader’s convenience, we have collected many of the relevant results both in policy
optimization and evaluation in Table 1, along with the settings and sample-size regimes in which
they apply, the nature of the guarantee, and the salient techniques used.
Contributions: We study the problem of estimating the infinite-horizon, discounted value func-
tion of a tabular MRP in `∞ -norm, assuming access to state transitions and reward samples under
the generative model. Our first main result, Theorem 1, analyzes the plug-in estimator, showing
two types of guarantees: on one hand, we derive high-probability upper bounds on the error that
can be computed based on the observed data, and on the other, we show upper bounds that depend
on the underlying (unknown) population transition matrix and reward function. The latter result
is achieved via a decoupling argument that we expect to be more broadly applicable to problems
of this type.
Corollary 1 then specializes the population-based result in Theorem 1 to natural sub-classes of
MRPs. Theorem 2 provides minimax lower bounds for these sub-classes, showing—in conjunction
with Corollary 1—that the plug-in approach is minimax optimal over the class of MRPs with
uniformly bounded reward functions. However, these results suggest that the plug-in approach is
not minimax-optimal over the class of MRPs having value functions with bounded variance under
the transition model (this notion is defined precisely in Section 3 to follow). Consequently, we
analyze an approach based on the median-of-means device and show that this modified estimator
is minimax optimal over the class of MRPs having value functions with bounded variance.
The benefits of our instance-dependent guarantees are even evident in a model as simple as the
3-state MRP illustrated in Figure 1. Suppose that we observe noiseless rewards of this MRP and
wish to compute its infinite-horizon value function with discount factor γ ∈ (0, 1). Bounds based
on the contractivity of the Bellman operator [KS99, Kak03, Wai19b] imply that the `∞ -error of the
plug-in estimate scales proportionally to 1/(1 − γ)2 . The worst-case bounds of Azar et al. [AMK13]
4
1/2
<latexit sha1_base64="QAvYPduoiMRvGyKDDqEcK/SZzB0=">AAAB6nicbVBNS8NAEJ34WetX1aOXxSJ4qkkV9Fj04rGi/YA2lM120y7dbMLuRCihP8GLB0W8+ou8+W/ctjlo64OBx3szzMwLEikMuu63s7K6tr6xWdgqbu/s7u2XDg6bJk414w0Wy1i3A2q4FIo3UKDk7URzGgWSt4LR7dRvPXFtRKwecZxwP6IDJULBKFrpwTuv9kplt+LOQJaJl5My5Kj3Sl/dfszSiCtkkhrT8dwE/YxqFEzySbGbGp5QNqID3rFU0YgbP5udOiGnVumTMNa2FJKZ+nsio5Ex4yiwnRHFoVn0puJ/XifF8NrPhEpS5IrNF4WpJBiT6d+kLzRnKMeWUKaFvZWwIdWUoU2naEPwFl9eJs1qxbuoVO8vy7WbPI4CHMMJnIEHV1CDO6hDAxgM4Ble4c2Rzovz7nzMW1ecfOYI/sD5/AFZWY0u</latexit>
1/2
<latexit sha1_base64="QAvYPduoiMRvGyKDDqEcK/SZzB0=">AAAB6nicbVBNS8NAEJ34WetX1aOXxSJ4qkkV9Fj04rGi/YA2lM120y7dbMLuRCihP8GLB0W8+ou8+W/ctjlo64OBx3szzMwLEikMuu63s7K6tr6xWdgqbu/s7u2XDg6bJk414w0Wy1i3A2q4FIo3UKDk7URzGgWSt4LR7dRvPXFtRKwecZxwP6IDJULBKFrpwTuv9kplt+LOQJaJl5My5Kj3Sl/dfszSiCtkkhrT8dwE/YxqFEzySbGbGp5QNqID3rFU0YgbP5udOiGnVumTMNa2FJKZ+nsio5Ex4yiwnRHFoVn0puJ/XifF8NrPhEpS5IrNF4WpJBiT6d+kLzRnKMeWUKaFvZWwIdWUoU2naEPwFl9eJs1qxbuoVO8vy7WbPI4CHMMJnIEHV1CDO6hDAxgM4Ble4c2Rzovz7nzMW1ecfOYI/sD5/AFZWY0u</latexit>
1 1
r=0
<latexit sha1_base64="dZS0/8KsJ+mtAV3z7R6zmBk6lfU=">AAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoBeh6MVjBdMW2lA220m7dLMJuxuhlP4GLx4U8eoP8ua/cdvmoK0PBh7vzTAzL0wF18Z1v53C2vrG5lZxu7Szu7d/UD48auokUwx9lohEtUOqUXCJvuFGYDtVSONQYCsc3c381hMqzRP5aMYpBjEdSB5xRo2VfEVuiNsrV9yqOwdZJV5OKpCj0St/dfsJy2KUhgmqdcdzUxNMqDKcCZyWupnGlLIRHWDHUklj1MFkfuyUnFmlT6JE2ZKGzNXfExMaaz2OQ9sZUzPUy95M/M/rZCa6DiZcpplByRaLokwQk5DZ56TPFTIjxpZQpri9lbAhVZQZm0/JhuAtv7xKmrWqd1GtPVxW6rd5HEU4gVM4Bw+uoA730AAfGHB4hld4c6Tz4rw7H4vWgpPPHMMfOJ8/ex+Nzw==</latexit>
r = 1/2
<latexit sha1_base64="b2xWt558vimLDgfqeW6+uYP4nFY=">AAAB7nicbVBNS8NAEJ34WetX1aOXxSJ4qkkV9CIUvXisYD+gDWWznbRLN5uwuxFK6I/w4kERr/4eb/4bt20O2vpg4PHeDDPzgkRwbVz321lZXVvf2CxsFbd3dvf2SweHTR2nimGDxSJW7YBqFFxiw3AjsJ0opFEgsBWM7qZ+6wmV5rF8NOME/YgOJA85o8ZKLUVuiHde7ZXKbsWdgSwTLydlyFHvlb66/ZilEUrDBNW647mJ8TOqDGcCJ8VuqjGhbEQH2LFU0gi1n83OnZBTq/RJGCtb0pCZ+nsio5HW4yiwnRE1Q73oTcX/vE5qwms/4zJJDUo2XxSmgpiYTH8nfa6QGTG2hDLF7a2EDamizNiEijYEb/HlZdKsVryLSvXhsly7zeMowDGcwBl4cAU1uIc6NIDBCJ7hFd6cxHlx3p2PeeuKk88cwR84nz9b/45F</latexit>
r=1
<latexit sha1_base64="u9ZNk91OAp1kQC6KfId3WPkpmVs=">AAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoBeh6MVjBdMW2lA220m7dLMJuxuhlP4GLx4U8eoP8ua/cdvmoK0PBh7vzTAzL0wF18Z1v53C2vrG5lZxu7Szu7d/UD48auokUwx9lohEtUOqUXCJvuFGYDtVSONQYCsc3c381hMqzRP5aMYpBjEdSB5xRo2VfEVuiNcrV9yqOwdZJV5OKpCj0St/dfsJy2KUhgmqdcdzUxNMqDKcCZyWupnGlLIRHWDHUklj1MFkfuyUnFmlT6JE2ZKGzNXfExMaaz2OQ9sZUzPUy95M/M/rZCa6DiZcpplByRaLokwQk5DZ56TPFTIjxpZQpri9lbAhVZQZm0/JhuAtv7xKmrWqd1GtPVxW6rd5HEU4gVM4Bw+uoA730AAfGHB4hld4c6Tz4rw7H4vWgpPPHMMfOJ8/fKON0A==</latexit>
Figure 1: A simple 3-state Markov reward process.
imply a rate 1/(1 − γ)3/2 . But the optimal local result captured in this paper shows that the error
is only proportional to 1/(1 − γ). For a discount factor γ = 0.99, this improves the previous bounds
by factors of 100 and 10, respectively, and consequently, the respective sample complexities by
factors of 104 and 102 . Instance-dependent results therefore allow us to differentiate problems that
are “solvable” with finite samples from those that are not.
Notation: For a positive integer n, let [n] : = {1, 2, . . . , n}. For a finite set S, we use |S| to
denote its cardinality. We use c, C, c1 , c2 , . . . to denote universal constants that may change from
line to line. We use the convenient shorthand a ∨ b : = max{a, b} and a ∧ b = min{a, b}. We let 1
denote the all-ones vector in RD , and abusing notation slightly, we let 1 {E} denote the indicator
of an event E. Let ej denote the jth standard basis vector in RD . We let v(i) denote the i-th order
statistic of a vector v, i.e., the i-th largest entry of v. For a pair of vectors (u, v) of compatible
dimensions, we use the notation u v to indicate that the difference vector v − u is entry-wise
non-negative. The relation u v is defined analogously. We let |u| denote the entry-wise absolute
value of a vector u ∈ RD ; squares and square-roots of vectors are, analogously, taken entrywise.
Note that for a positive scalar λ, the statements |u| λ · 1 and kuk∞ ≤ λ are equivalent. Finally,
we let kMk1,∞ denote the maximum `1 -norm of the rows of a matrix M, and refer to it as the
(1, ∞)-operator norm of a matrix.
2 Background and Problem Formulation

In this section, we introduce the basic notation required to specify a Markov reward process, and
formally define the problem of estimating value functions in the generative setting.
2.1 Markov reward processes and value functions

We study Markov reward processes defined on a finite set X of states, indexed as X = {1, 2, . . . , D}.
The state evolution over time is determined by a set of transition functions {P ( · | x), x ∈ X , with
the transition from state x to the next state being randomly chosen according to the distribution
P ( · | x). For notational convenience, we let P ∈ [0, 1]D×D denote a row stochastic (Markov)
transition matrix, where row j of this matrix—which we denote by pj —collects the transition
function of the j-th state. Also associated with an MRP is a population reward function r : X 7→ R:
transitioning from state x results in the reward r(x). For convenience, we engage in a minor abuse
of notation by letting r also denote a vector of length D, with rj corresponding to the reward
obtained at the j-th state.
In this paper, we consider the infinite-horizon, discounted reward as our notion for the long-term
value of a state in the MRP. In particular, for a scalar discount factor γ ∈ (0, 1), the long-term
5
value of state x in the MRP is given by
"∞ #
X
θ∗ (x) : = E γ k r(xk ) x0 = x , where xk ∼ P ( · | xk−1 ) for all k ≥ 1.

k=0
In words, this measures the expected discounted reward obtained by starting at the state x, where
the expectation is taken with respect to the random transitions over states. Once again, we use θ∗
to also denote a vector of length D, where θj∗ corresponds to the value of the j-th state.
A note to the reader: in the sequel, we often reference a state simply by its index, and often
refer to the state space X ≡ [D]. Accordingly, we also use P ( · | j) to denote the transition
function corresponding to state j ∈ [D].
2.2 Observation model

Given access to the true transition and reward functions, it is straightforward, at least in principle,
to compute the value function. By definition, it is the unique solution of the Bellman fixed point
relation
θ∗ = r + γPθ∗ . (1)
In the learning setting, the pair (P, r) is unknown, and we instead assume access to a black box
that generates samples from the transition and reward functions. In this paper, we operate under a
setting known as the synchronous or generative setting; it is a stylized observation model that has
been used extensively in the study of Markov decision processes (see Kearns and Singh [KS99] for an
introduction). Let us introduce it in the context of MRPs: for a given sample index k = 1, 2, . . . , N
and for each state j ∈ [D], we observe a random next state Xk,j ∈ [D] drawn according to the
transition function P ( · | j), and a random reward Rk,j drawn from a conditional distribution
Dr ( · | j). Throughout, we assume that the rewards are generated independently across states,
with E[Rk,j ] = rj . Letting ρ(r) denote a non-negative vector indexed by the states j ∈ [D], we
assume the conditional distributions {Dr ( · | j), j ∈ [D]} are ρ(r)-sub-Gaussian, meaning that for
each j ∈ [D], we have
h i λ2 ρ2
j (r)
ER∼Dr ( · |j) eλ(R−rj ) ≤ e 2 for all λ ∈ R. (2)
With N such i.i.d. samples in hand, our goal is to estimate the value function θ∗ in the `∞ -error
metric.
Such a goal is particularly relevant to the policy evaluation problem described in the introduc-
tion, since `∞ -estimates of the value function can be used in conjunction with a policy improvement
sub-routine to eventually arrive at an optimal policy (see, e.g., Section 1.2.2. of the recent mono-
graph [AJK19]). We note in passing that bounds proved under the generative model may be
translated into the more challenging online setting via the notion of Markov cover times (see, e.g.,
the papers [EDM03, AMGK11] for conversions of this type for Markov decision processes).
3 Main results
We now turn to the statement and discussion of our main results. We begin by providing `∞ -guarantees
on value function estimation for the natural plug-in approach.
6
3.1 Guarantees for the plug-in approach
A natural approach to this problem is use the observations to construct estimates (P, b rb) of the
pair (P, r), and then substitute or “plug in” these estimates into the Bellman equation, thereby
obtaining the value function of the MRP having transition matrix P b and reward vector rb.
In order to define the plug-in estimator, let us introduce some helpful notation. For each time
index k, we use the associated set of state samples {Xk,j , j ∈ [D]} to form a random binary matrix
Zk ∈ {0, 1}D×D , in which row j has a single non-zero entry, determined by the sample Xk,j . Thus,
the location of the non-zero entry in row j is drawn from the probability distribution defined by pj ,
the j-th row of P. Recall that our observations also include the stochastic reward vectors {Rk }N k=1
sampled from the reward distribution Dr . Based on these observations, we define the sample means
N N
1 X 1 X
P=
b Zk and rb = Rk , (3)
N N
k=1 k=1
which can be seen as unbiased estimates of the transition matrix P and the reward vector r,
respectively.
The estimates (P,
b rb) define a new MRP, and its value function is given by the fixed point relation
θbplug = rb + γ P
b θbplug . (4)
Solving this fixed point equation, we obtain the closed form expression θbplug = (I − γ P) b −1 rb for
the plug-in estimator. Note that the terminology “plug-in” arises the fact that θbplug is obtained by
substituting the estimates (P,
b rb) into the original Bellman equation (1). We also note that in this
special case—that is, the tabular setting without function approximation—the plug-in estimate is
equivalent to the LSTD solution [BB96, Boy02].
In order to establish guarantees for the estimator θbplug , we require some additional notation. As
mentioned before, we are interested in non-asymptotic, instance-dependent guarantees of two types:
the first is a bound that can be evaluated in practice from the observed data, and the second is a
guarantee that depends on the unknown population quantities P and r. For each vector θ ∈ RD ,
define the vector of empirical variances
2
b2 (θ) = E
σ b (Z − P)θ
b ,
where Eb denotes expectation over the empirical distribution (i.e., the random matrix Z is drawn
uniformly at random from the set {Zk }N
k=1 ). Note that given θ, this quantity is computable purely
from the observed samples. On the other hand, the population result will involve the population
variance vector
σ 2 (θ) = E |(Z − P)θ|2 ,
where in this case Z is drawn according to the population model P. As a final definition, the span
semi-norm of a value function θ is given by
kθkspan : = max θ(x) − min θ(x).

x∈X x∈X
7
Equivalently, the span semi-norm is equal to the variation of the vector θ ∈ RD ; see Puter-
man [Put05] for more details.
We now ready to state our main result for the plug-in estimator.
γ 2
Theorem 1. There is a pair of universal constants (c1 , c2 ) such that if N ≥ c1 (1−γ)2 log(8D/δ),
then each of the following statements holds with probability at least 1 − δ.
(a) We have
(r )
log(8D/δ) kρ(r)k ∞ log(8D/δ) γkθbplug kspan
kθbplug − θ∗ k∞ ≤ c2 b −1 σ
γk(I − γ P) b(θbplug )k∞ + + · .
N 1−γ N 1−γ
(5a)
(b) We have
(r )
∗k
log(8D/δ) kρ(r)k∞ log(8D/δ) γkθ span
kθbplug − θ∗ k∞ ≤ c2 γ (I − γP)−1 σ(θ∗ ) ∞ +

+ · .
N 1−γ N 1−γ
(5b)
It is worth making a few comments on this theorem, which provides two instance-dependent
upper bounds on the error of the plug-in approach. Assuming for simplicity of discussion2 that
the maximum noise reward parameter kρ(r)k∞ is known, then part (a) of the theorem provides a
bound that can be evaluated based on the observed data; bounds of this form are especially useful
in downstream analyses. For instance, a central consideration in policy iteration methods is to
obtain “good enough” value function estimates θb for fixed policies, in that we have kθb − θ∗ k∞ ≤
for some prescribed tolerance . Theorem 1(a) provides a method by which such a bound may be
verified for the plug-in approach: compute the statistic on the RHS of bound (5a); if this is less
than , then the bound kθbplug − θ∗ k∞ ≤ holds with probability exceeding 1 − δ.
On the other hand, Theorem 1(b) provides a guarantee that depends on the unknown problem
instance. From the perspective of the analysis, this is the more difficult bound to establish, since
it requires a leave-one-out technique to decouple dependencies between the estimate θbplug and the
matrix P.b We expect our technique—presented in full in Section 5.2—and its variants to be more
broadly useful in analyzing other problems in reinforcement learning besides the policy evaluation
problem considered here.
c1
Third, note that our lower bound on the sample size—which evaluates to N ≥ (1−γ) 2 log(8D/δ)
for any strictly positive discount factor—is unavoidable in general. In particular, for any fixed
reward-noise parameter kρ(r)k∞ > 0, this condition is required in order to obtain a consistent
estimate of the value function.3 On the other hand, in the special case of deterministic rewards
(kρ(r)k∞ = 0), we suspect that this condition can be weakened, but leave this for future work.
2
We note that when ρ(r) is not known but the reward distribution is (say) Gaussian, it is straightforward to
provide an entry-wise upper bound for it by computing the empirical standard deviation of rewards from samples,
and using this to define a high-probability and data-dependent bound on the sub-Gaussian parameter.
3
For instance, even with
known transition dynamics, estimating the value function of a single state to within
1
additive error requires Ω (1−γ) 2 2 samples of the noisy reward.
8
Finally, it is worth noting that there are two terms in the bounds of Theorem 1: the first term
corresponds to a notion of standard deviations of the estimated/true value function and reward, and
the second depends on the span semi-norm of the value function. Are both of these terms necessary?
What is the optimal rate at which any value function can be estimated? These questions motivate
the analysis to be presented in the following section.
3.2 Is the plug-in approach optimal?

In order to study the question of optimality, we adopt the notion of local minimax risk, in which the
performance of an estimator is measured in a worst-case sense locally over natural subsets of the
parameter space. Our upper bounds depend on the problem instance via the standard deviation
function σ(θ∗ ), the reward standard deviation ρ(r), and the span semi-norm of θ∗ . Accordingly, we
define the following subsets4 of Markov reward processes (MRPs):
n o
Mvar (ϑ, %) : = set of all MRPs s.t. kσ(θ∗ )k∞ ≤ ϑ and kρ(r)k∞ ≤ % , (6a)
n o
Mvfun (ζ, %) : = set of all MRPs s.t. kθ∗ kspan ≤ ζ and kρ(r)k∞ ≤ % , and (6b)
n o
Mrew (rmax , %) : = set of all MRPs s.t. krk∞ ≤ rmax and kρ(r)k∞ ≤ % . (6c)
Letting M be any one of these sets, we use the shorthand θ ∈ M to mean that θ is the value
function of some MRP in the set M. Each choice of the set M defines the local minimax risk given
by
h i
inf sup E kθb − θ∗ k∞ ,
θb θ∗ ∈M
where the infimum ranges over all measurable functions θb of N observations from the generative
model. With this set-up, we can now state some lower bounds in terms of such local minimax risks:
Theorem 2. There is a pair of absolute constants (c1 , c2 ) such that for all γ ∈ [ 21 , 1) and sample
c1
sizes N ≥ 1−γ log(D/2), the following statements hold.
√
(a) For each triple of positive scalars (ϑ, ζ, %) satisfying5 ϑ ≤ ζ 1 − γ, we have
r
h
∗
i c2 log(D/2)
inf sup E kθb − θ k∞ ≥ (ϑ + %) . (7a)
∗
θ θ ∈Mvar (ϑ,%)∩Mvfun (ζ,%)
b 1 − γ N
q
(b) For each pair of positive scalars (rmax , %) satisfying rmax ≥ % logND , we have
r
h i c2 log(D/2) rmax
inf sup E kθb − θ∗ k∞ ≥ +% . (7b)
θb θ∗ ∈Mrew (rmax ,%) 1−γ N (1 − γ)1/2
4
The following mnemonic device may help the reader appreciate and remember notation: the symbol ϑ, or
“vartheta”, stands for a measure of the variability in the value function θ; the symbol %, or “varrho”, represents the
variability in reward samples, and rmax represents the maximum absolute reward mean.
5
We conjecture that this lower bound can be proved under the weaker condition ϑ ≤ ζ, thereby matching the
condition present in Corollary 1(a).
9
Equipped with these lower bounds, we can now assess the local minimax optimality of the
plug-in estimator. In order to facilitate this comparison, let us state a corollary of Theorem 1
that provides bounds on the worst-case error of the plug-in estimator over particular subsets of
the parameter space. In order to further simplify the comparison, we restrict our attention to the
range γ ∈ [ 12 , 1) covered by the lower bounds.
Corollary 1. There are absolute constants (c3 , c4 ) such that for all γ ∈ [ 21 , 1) and sample sizes6
c3
N ≥ (1−γ) 2 log(8D/δ), the following statements hold.
(a) Consider a triple of positive scalars (ϑ, ζ, %) such that7 ϑ ≤ ζ. Then for any value function
θ∗ ∈ Mvar (ϑ, %) ∩ Mvfun (ζ, %), we have
(r )
c4 log(8D/δ) log(8D/δ)
kθbplug − θ∗ k∞ ≤ (ϑ + %) + ·ζ (8a)
1−γ N N
with probability at least 1 − δ.
(b) Consider an arbitrary pair of positive scalars (rmax , %). Then for any value function θ∗ ∈ Mrew (rmax , %),
we have
r
∗ c4 log(8D/δ) rmax
kθplug − θ k∞ ≤
b +% (8b)
1−γ N (1 − γ)1/2
with probability at least 1 − δ.
By comparing Corollary 1(b) with Theorem 2(b), we see that the plug-in estimator is minimax
optimal (up to constant factors) over the class Mrew (rmax , %). This conclusion parallels that of Azar
et al. [AMK13] for the related problem of optimal state-value function estimation in MDPs. (In our
notation, their work applies to the special case of % = 0, but their analysis can easily be extended
to this more general setting.)
A comparison of part (a) of the two results is more interesting. Here we see that the first term
in the upper bound (8a) matches the lower bound (7a) up to a constant factor. The second term of
inequality (8a), however, does not have an analogous component in the lower bound, and this leads
us to the interesting question of whether the analysis of the plug-in estimator can be sharpened so
as to remove the dependence of the error on the span semi-norm kθ∗ kspan . Proposition 1, presented
in Appendix A, shows that this is impossible in general, and that there are MRPs in which the `∞
error can be lower bounded by a term that is proportional to the span semi-norm.
This raises another natural question: Is there a different estimator whose error can be bounded
independently of the span semi-norm kθ∗ kspan , and which is able achieve the lower bound (7a)? In
the next section, we introduce such an estimator via a median-of-means device.
6
As shown in the proof, part (a) of the corollary holds without this assumption on the sample size, but we state
it here to facilitate a direct derivation of Corollary 1 from Theorem 1.
7
It is worth noting that the condition ϑ ≤ ζ in part (a) of the corollary does not entail any loss of generality, since
we always have kσ(θ∗ )k∞ ≤ kθ∗ kspan . Indeed, for MRPs in which kσ(θ∗ )k∞ kθ∗ kspan , the second term on the RHS
of inequality (8a) will dominate the bound unless the sample size N is large.
10
3.3 Closing the gap via the median-of-means method
In many situations, the span semi-norm of a value function θ∗ may be much larger its variance σ(θ∗ )
under the transition model. Such a discrepancy arises when there are states with extremely large
positive (or negative) rewards that are visited with very low probability. In such cases, the second
terms in the bounds (5) dominate the first. It is thus of interest to derive bounds that are purely
“variance-dependent” and independent of the span norm. In order to do so, we analyze a slight
variant of the plug-in approach. In particular, we analyze the median-of-means estimator, which is
a standard robust alternative to the sample mean in other scenarios [NY83, LL20]. In the context
of reinforcement learning, Pazis et al. [PPH16] made use of it for online policy optimization in
MDPs.
In our setting, we only employ median-of-means to obtain a better estimate of term depending
on the transition matrix; we still use the estimate rb defined in equation (3) as our estimate of
the reward function.8 Given the data set {Zk }N D
k=1 and some vector θ ∈ R , the median-of-means
estimate M(θ)
c of the population expectation Pθ is given by the following nonlinear operation:
• First, split the data set into K equal parts denoted {D1 , . . . , DK }, where each subset Di has
size m = bN/Kc.
• Second, compute the empirical mean µ 1 P

bi (θ) : = m k∈Di Zk θ for each i ∈ [K].
• Finally, return the quantity M(θ)

c : = med(bµ1 (θ), . . . , µ
bK (θ)), where the median—defined for
convenience as the bK/2c-th order statistic—is taken entry-wise.
The random operator M

c defines the median-of-means empirical Bellman operator, given by
TbNMoM (θ) : = rb + γ M(θ).

c (9)
As shown in Lemma 6 (see Section 5), this operator is γ-contractive in the `∞ -norm. Consequently,
it has a unique fixed point, which we term the median-of-means value function estimate, denoted
by θbMoM .
In practice, the estimate θbMoM can be found by starting at an arbitrary initialization and
repeatedly applying the γ-contractive operator TbNMoM until convergence.9 The following theorem
provides a population-based guarantee on the error of this estimator.
Theorem 3. Suppose that the median-of-means operator M c is constructed with the parameter
choice K = 8 log(4D/δ). Then there is a universal constant c such that we have
r
∗ c log(8D/δ)
kθbMoM − θ k∞ ≤ γkσ(θ∗ )k∞ + kρ(r)k∞ (10)
1−γ N
with probability exceeding 1 − δ.

8
In principle, one could run a median-of-means estimate on the combination of reward and transition, but this is
not necessary in our setting due to the sub-Gaussian assumption on the reward noise (2). Slight modifications of our
techniques also yield bounds for the combined median-of-means estimate assuming only that the standard deviation
of the reward noise is bounded entry-wise by the vector ρ(r).
9
Since the operator is γ-contractive, it suffices to run this iterative algorithm for logγ to obtain an -approximate
fixed point in an additive sense.
11
We have thus achieved our goal of obtaining a purely variance-dependent bound. Indeed, for
each pair of positive scalars (ϑ, %), any value function θ∗ ∈ Mvar (ϑ, %), and reward distribution
satisfying kρ(r)k∞ ≤ %, we have
r
c log(8D/δ)
kθbMoM − θ∗ k∞ ≤ (ϑ + %) ,
1−γ N
with probability exceeding 1 − δ. Integrating this tail bound yields an analogous upper bound on
the expected error, which matches the lower bound (7a) on the expected error up to a constant
factor. As a corollary, we conclude that the minimax risk over the class Mvar (ϑ, %) scales as
r
h i 1 log(D)
inf sup E kθb − θ∗ k∞ (ϑ + %) , (11)
θb θ∗ ∈Mvar (ϑ,%) 1−γ N
and is achieved (up to constant factors) by the estimator θbMoM .

However, our results fall short of showing that the estimator θbMoM is minimax optimal over the
class Mrew (rmax , %) of MRPs with bounded rewards. Indeed, for any value function θ∗ in the class
Mrew (rmax , %), Theorem 3 yields the corollary
r
c log(8D/δ) rmax
kθbMoM − θ∗ k∞ ≤ γ +%
1−γ N 1−γ
with probability exceeding 1 − δ. Comparing inequality (7b) with this bound, we see that our
1
upper bound on the median-of-means estimator is sub-optimal by a factor (1 − γ)− 2 in the discount
complexity. From a technical standpoint, this is due to the fact that our upper bound in Theorem 3
1
involves the functional 1−γ kσ(θ∗ )k∞ and not the sharper functional k(I − γP)−1 σ(θ∗ )k∞ present
in Theorem 1(b). We believe that this gap is not intrinsic to the MoM method, and conjecture
that an upper bound depending on the latter functional can be proved for the estimator θbMoM ;
this would guarantee that the median-of-means estimator is also minimax optimal over the class
Mrew (rmax , %).
4 Numerical experiments
In this section, we explore the sharpness of our theoretical predictions, for both the plug-in and
the median-of-means (MoM) estimator. Our bounds predict a range of behaviors depending on
the scaling of the maximum standard deviation kσ(θ∗ )k∞ , and the span semi-norm (for the plug-in
estimator). Let us verify these scalings via some simple experiments.
4.1 Behavior on the “hard” example used for the lower bound
First, we use a simple variant of our lower bound construction illustrated in panel (a) of Figure 2.
This MRP consists of D = 2 states, where state 1 stays fixed with probability p, transitions to
state 2 with probability 1 − p, and state 2 is absorbing. The rewards in states 1 and 2 are given by
ν and ντ , respectively. Here the triple (p, ν, τ ), along with the discount factor γ, are parameters of
the construction.
12
Log error versus log(1/(1 ))
= 0.00, (1.554, 1.569, 1.500)
0 = 0.50, (1.045, 1.056, 1.000)
= 1.00, (0.547, 0.564, 0.500)
1 p 1
log -norm error

2
MoM
p 1 MoM
3 Plug
Plug MoM
r=⌫ r =⌫·⌧ 4
Plug
5
P0 (p, ⌫, ⌧ ) 2.0 2.5 3.0 3.5 4.0
Complexity log(1/(1 ))
(a) (b)
Figure 2. (a) Illustration of the MRP R0 (p, ν, τ ) used in the simulation, and also as a building
block in the lower bound construction of Theorem 2. For the simulation, we choose p = 4γ−1 3γ , let
α
ν = 1, and set τ = 1 − (1 − γ) . (b) Log-log plot of the `∞ -error versus the discount complexity
parameter 1/(1 − γ) for both the plug-in estimator (in + markers) and median-of-means estimator
(in • markers) averaged over T = 1000 trials with N = 104 samples each. We have also plotted the
least-squares fits through these points, and the slopes of these lines are provided in the legend. In
particular, the legend contains the tuple of slopes (βbplug , βbMoM , β ∗ ) for each value of α. Logarithms
are to the natural base.
In order to parameterize this MRP in a scalarized manner, we vary the triple (p, ν, τ ) in the
following way. First, we fix a scalar α in the unit interval [0, 1], and then we set
4γ−1
p= 3γ , ν = 1, and τ = 1 − (1 − γ)α .
Note that this sub-family of MRPs is fully parameterized by the pair (γ, α). Let us clarify why this
particular scalarization is interesting. As shown in the proof of Theorem 2 (see equation (34)), the
underlying MRP has maximal standard deviation scaling as
0.5−α
∗ 1
kσ(θ )k∞ ∼ .
1−γ
Consequently, by the bound (10) from Theorem 3, for a fixed sample size N , the MoM estimator
1.5−α
1
should have `∞ -norm scaling as 1−γ . As we discuss in Appendix B, the same prediction
1
also holds for the plug-in estimator, assuming that N % (1−γ) .
In order to test this prediction, we fixed the parameter α ∈ [0, 1], and generated a range of MRPs
with different values of the discount factor γ. For each such MRP, we drew N = 104 samples from
the generative observation model and computed both the plug-in and median-of-means estimators,
where the latter estimator was run with the choice K = 20. While the plug-in estimator has a simple
13
closed-form expression, the MoM estimator was obtained by running the median-of-means Bellman
operator TbNMoM iteratively until it converged to its fixed point; we declared that convergence had
occurred when the `∞ -norm of the difference between successive iterates fell below 10−8 .
In panel (b) of Figure 2, we plot the `∞ -error, of both the plug-in approach as well as the
median-of-means estimator, as a function of γ. The plot shows the behavior for three distinct
values α = {0, 0.5, 1}. Each point on each curve is obtained by averaging 1000 Monte Carlo trials
of the experiment. Note that on this log-log plot, we see a linear relationship between the log
`∞ -error and log discount complexity, with the slopes depending on the value of α. More precisely,
from our calculations above, our theory predicts that the log `∞ -error should be related to the log
1
complexity log 1−γ in a linear fashion with slope
β ∗ = 1.5 − α.
Consequently, for both the plug-in and MoM estimators, we performed a linear regression to es-
timate these slopes, denoted by βbplug and βbMoM respectively. The plot legend reports the triple
(βbplug , βbMoM , β ∗ ), and for each we see good agreement between the theoretical prediction β ∗ and its
empirical counterparts.
4.2 When does the MoM estimator perform better than plug-in?
Our theoretical results predict that the MoM estimator should outperform the plug-in approach
when the span semi-norm of the value function kθ∗ kspan is much larger than its maximum standard
deviation kσ(θ∗ )k∞ . Indeed, Proposition 1 in Appendix A demonstrates that there are MRPs on
which the `∞ -error of the plug-in estimator grows with the span semi-norm of the optimal value
function. Let us now simulate the behavior of both the plug-in and MoM approach on this MRP,
constructed by taking D/3 copies of the 3-state MRP in Figure 3(a).
Our simulation is carried out on N = 104 samples from this D-state MRP, with noiseless
observations of the reward. In order to parameterize the MRP via the discount factor alone, we fix
the pair (q, D) in the following way. First, we fix a scalar α in the unit interval [0, 1], and then set
j 1 α k 10
D=3 and q= .
1−γ ND
Note that this sub-family of MRPs is fully parameterized by the pair (γ, α). The construction also
ensures that
kθ∗ k∞ kσ(θ∗ )k∞
√ , (12)
N N
for this chosen parameterization, and furthermore, that the ratio of the LHS and RHS of inequal-
ity (12) increases as the dimension D increases (see the proof of Proposition 1).
As shown in Proposition 1 in Appendix A, the `∞ error of the plug-in estimator for this family
of MRPs can be lower bounded by kθ∗ k∞ /N . It is also straightforward to show that the error of the
∗ )k
MoM estimator is upper bounded by the quantity kσ(θ √ ∞ . Now increasing the value of α increases
N
the dimension D, and so the MoM estimator should behave better and better for larger values of
α. In particular, this behavior can be captured in the log-log plot of the error against 1/(1 − γ),
which is presented in Figure 3(b).
14
Log error versus log(1/(1 ))
= 0.50, (0.981, 0.540)
5.0 = 0.75, (0.955, 0.301)
= 1.00, (0.912, 0.053)
5.5
log -norm error

6.0
1
<latexit sha1_base64="SpL9N6S1taWNCfcbf53EOXgg4J8=">AAAB7HicbVBNS8NAEJ34WetX1aOXxSJ4sSRV0GPRi8cKpi20oWy2k3bpZhN3N0Ip/Q1ePCji1R/kzX/jts1BWx8MPN6bYWZemAqujet+Oyura+sbm4Wt4vbO7t5+6eCwoZNMMfRZIhLVCqlGwSX6hhuBrVQhjUOBzXB4O/WbT6g0T+SDGaUYxLQvecQZNVbyPXJOHrulsltxZyDLxMtJGXLUu6WvTi9hWYzSMEG1bntuaoIxVYYzgZNiJ9OYUjakfWxbKmmMOhjPjp2QU6v0SJQoW9KQmfp7YkxjrUdxaDtjagZ60ZuK/3ntzETXwZjLNDMo2XxRlAliEjL9nPS4QmbEyBLKFLe3EjagijJj8ynaELzFl5dJo1rxLirV+8ty7SaPowDHcAJn4MEV1OAO6uADAw7P8ApvjnRenHfnY9664uQzR/AHzucPYbuNvw==</latexit>
q q
6.5
<latexit sha1_base64="b+n3e1YQEa9MneUTanzPe56IiCk=">AAAB6HicbVDLTgJBEOzFF+IL9ehlIjHxRHbRRI9ELx4hkUcCGzI79MLI7Ow6M2tCCF/gxYPGePWTvPk3DrAHBSvppFLVne6uIBFcG9f9dnJr6xubW/ntws7u3v5B8fCoqeNUMWywWMSqHVCNgktsGG4EthOFNAoEtoLR7cxvPaHSPJb3ZpygH9GB5CFn1Fip/tgrltyyOwdZJV5GSpCh1it+dfsxSyOUhgmqdcdzE+NPqDKcCZwWuqnGhLIRHWDHUkkj1P5kfuiUnFmlT8JY2ZKGzNXfExMaaT2OAtsZUTPUy95M/M/rpCa89idcJqlByRaLwlQQE5PZ16TPFTIjxpZQpri9lbAhVZQZm03BhuAtv7xKmpWyd1Gu1C9L1ZssjjycwCmcgwdXUIU7qEEDGCA8wyu8OQ/Oi/PufCxac042cwx/4Hz+ANzrjPk=</latexit>
1 1
7.0
r=0 r = µ/2 r=µ
7.5
<latexit sha1_base64="dZS0/8KsJ+mtAV3z7R6zmBk6lfU=">AAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoBeh6MVjBdMW2lA220m7dLMJuxuhlP4GLx4U8eoP8ua/cdvmoK0PBh7vzTAzL0wF18Z1v53C2vrG5lZxu7Szu7d/UD48auokUwx9lohEtUOqUXCJvuFGYDtVSONQYCsc3c381hMqzRP5aMYpBjEdSB5xRo2VfEVuiNsrV9yqOwdZJV5OKpCj0St/dfsJy2KUhgmqdcdzUxNMqDKcCZyWupnGlLIRHWDHUklj1MFkfuyUnFmlT6JE2ZKGzNXfExMaaz2OQ9sZUzPUy95M/M/rZCa6DiZcpplByRaLokwQk5DZ56TPFTIjxpZQpri9lbAhVZQZm0/JhuAtv7xKmrWqd1GtPVxW6rd5HEU4gVM4Bw+uoA730AAfGHB4hld4c6Tz4rw7H4vWgpPPHMMfOJ8/ex+Nzw==</latexit>
<latexit sha1_base64="cHqFhNg/6Ejh15fUDuccHKDhOZ4=">AAAB8nicbVDLSgNBEOz1GeMr6tHLYBA8xd0o6EUIevEYwTxgs4TZyWwyZB7LzKwQQj7DiwdFvPo13vwbJ8keNLGgoajqprsrTjkz1ve/vZXVtfWNzcJWcXtnd2+/dHDYNCrThDaI4kq3Y2woZ5I2LLOctlNNsYg5bcXDu6nfeqLaMCUf7SilkcB9yRJGsHVSqNEN6ogMnaNqt1T2K/4MaJkEOSlDjnq39NXpKZIJKi3h2Jgw8FMbjbG2jHA6KXYyQ1NMhrhPQ0clFtRE49nJE3TqlB5KlHYlLZqpvyfGWBgzErHrFNgOzKI3Ff/zwswm19GYyTSzVJL5oiTjyCo0/R/1mKbE8pEjmGjmbkVkgDUm1qVUdCEEiy8vk2a1ElxUqg+X5dptHkcBjuEEziCAK6jBPdShAQQUPMMrvHnWe/HevY9564qXzxzBH3ifP/LDj7o=</latexit> <latexit sha1_base64="Fju3g5e8+ZDccY0Mkl2DcKfUGD0=">AAAB7nicbVBNSwMxEJ2tX7V+VT16CRbBU9mtgl6EohePFewHtEvJptk2NMmGJCuUpT/CiwdFvPp7vPlvTNs9aOuDgcd7M8zMixRnxvr+t1dYW9/Y3Cpul3Z29/YPyodHLZOkmtAmSXiiOxE2lDNJm5ZZTjtKUywiTtvR+G7mt5+oNiyRj3aiaCjwULKYEWyd1NboBvVE2i9X/Ko/B1olQU4qkKPRL3/1BglJBZWWcGxMN/CVDTOsLSOcTku91FCFyRgPaddRiQU1YTY/d4rOnDJAcaJdSYvm6u+JDAtjJiJynQLbkVn2ZuJ/Xje18XWYMalSSyVZLIpTjmyCZr+jAdOUWD5xBBPN3K2IjLDGxLqESi6EYPnlVdKqVYOLau3hslK/zeMowgmcwjkEcAV1uIcGNIHAGJ7hFd485b14797HorXg5TPH8Afe5w9hUo7x</latexit>
8.0
2.0 2.5 3.0 3.5 4.0
Complexity log(1/(1 ))
(a) (b)
Figure 3. (a) Illustration of the MRP R1 (q, µ). In the simulation as well as in the lower bound
construction of Proposition 1, we concatenate
k D/3 such MRPs to produce an MRP on D states. For
j α
1 10
the simulation, we choose D = 3 1−γ and set q = ND and µ = 1. (b) Log-log plot of the
`∞ -error versus the discount complexity parameter 1/(1 − γ) for both the plug-in estimator (in +
markers) and median-of-means estimator (in • markers) averaged over T = 1000 trials with N = 104
samples each. We have also plotted the least-squares fits through these points, and the slopes of these
lines are provided in the legend. In particular, the legend contains the tuple of slopes (βbplug , βbMoM )
for each value of α. Logarithms are to the natural base.
The plot shows the behavior for three distinct values α = {0.5, 0.75, 1}. Each point on each
curve is obtained by averaging 1000 Monte Carlo trials of the experiment. As expected, the MoM
estimator consistently outperforms the plug-in estimator for each value of α. Moreover, on this
log-log plot, we see a linear relationship between the log `∞ -error and log discount complexity, with
the slopes depending on the value of α. For both the plug-in and MoM estimators, we performed a
linear regression to estimate these slopes, denoted by βbplug and βbMoM respectively. The plot legend
reports the pair (βbplug , βbMoM ), and we see that the gap between the slopes increases as α increases.
5 Proofs
We now turn to the proofs of our main results. Throughout our proofs, the reader should recall
that the values of absolute constants may change from line-to-line. We also use the following facts
repeatedly. First, for a row stochastic matrix M with non-negative entries and any scalar γ ∈ [0, 1),
we have the infinite series
∞
X
−1
(I − γM) = (γM)t , (13a)
t=0
15
which implies that the entries of (I − γM)−1 are all non-negative. Second, for any such matrix, we
also have the bound k(I − γM)−1 k1,∞ ≤ 1−γ 1
. Finally, for any matrix A with positive entries and
a vector v of compatible dimension, we have the elementwise inequality
|Av| A|v|. (13b)
5.1 Proof of Theorem 1, part (a)

Throughout this proof, we adopt the convenient shorthand θb ≡ θbplug for notational convenience.
By the Bellman equations (1) and (4) for θ∗ and θ,
b respectively, we have
n o
θb − θ∗ = γ Pb θb − Pθ∗ + (b b θb − θ∗ ) + γ(P
r − r) = γ P( b − P)θ∗ + (b
r − r).
b : = θb − θ∗ and re-arranging implies the relation

Introducing the shorthand ∆
b −1 (P
b = γ(I − γ P)
∆ b − P)θ∗ + (I − γ P)
b −1 (b
r − r), (14)
and consequently, the elementwise inequality
|∆| b −1 |(P
b γ(I − γ P) b − P)θ∗ | + (I − γ P)
b −1 |(b
r − r)|, (15)
where we have used the relation (13b) with the matrix A = (I − γ P)b −1 . Given the sub-Gaussian
condition on the stochastic rewards, we can apply Hoeffding’s
q inequality combined with the union
bound to obtain the elementwise inequality |b r −r| c log(8D/δ)
N ·ρ(r), which holds with probability
δ b −1 has non-negative entries and (1, ∞)-norm at most 1 ,
at least 1 − 4 . Since the matrix (I − γ P) 1−γ
we have
r
b −1 |b c log(8D/δ)
(I − γ P) r − r| kρ(r)k∞ 1. (16a)
1−γ N
with the same probability. On the other hand, by Bernstein’s inequality, we have
(r )
∗ log(8D/δ) ∗ ∗ log(8D/δ)
|(P − P)θ | c
b · σ(θ ) + kθ kspan ·1
N N
with probability at least 1 − 4δ , and hence

(r )
kθ ∗k
−1 ∗ log(8D/δ) −1 ∗ span log(8D/δ)
(I − γ P)
b |(P
b − P)θ | c · k(I − γ P)
b σ(θ )k∞ + · 1.
N 1−γ N
(16b)
Substituting the bounds (16a) and (16b) into the elementwise inequality (15), we find that
(r )
∗k
log(8D/δ) b −1 σ(θ∗ )k∞ + kρ(r)k∞ γkθ span log(8D/δ)
|∆|
b c · γk(I − γ P) + · 1 (17)
N 1−γ 1−γ N
with probability at least 1 − 2δ .

Our next step is to relate the pair of population quantities (σ(θ∗ ), kθ∗ kspan ) to their empirical
analogues (b b kθk
σ (θ), b span ). The following lemma provides such a bound.
16
Lemma 1 (Population to empirical variance). We have the element-wise inequality
r
σ(θ∗ ) 2b
σ (θ) b + c0 kθ∗ kspan log(8D/δ) · 1
b + 2|∆| (18)
N
with probability at least 1 − δ/2.
Taking this lemma as given for the moment, let us complete the proof.
b −1 has non-negative entries, we can multiply both sides of the ele-

Since the matrix (I − γ P)
mentwise inequality (18) by it; doing so and taking the `∞ -norm yields
r
b ∞ c0 kθ∗ kspan log(8D/δ)
2k∆k
−1 ∗ −1
k(I − γ P) σ(θ )k∞ ≤ 2k(I − γ P) σ
b b b(θ)k∞ +
b + .
1−γ 1−γ N
Substituting back into the elementwise inequality (17) and taking `∞ -norms of both sides, we find
that
(r )
∗k
log(8D/δ) b −1 σ kρ(r)k ∞ γkθ span log(8D/δ)
k∆k
b ∞≤c γk(I − γ P) b(θ)k
b ∞+ +
N 1−γ 1−γ N
r
2cγ log(8D/δ) b
+ k∆k∞ .
1−γ N
Since the span semi-norm satisfies the triangle inequality, we have
kθ∗ kspan ≤ kθk

b span + k∆k
b span ≤ kθk
b span + 2k∆k
b ∞.
Substituting this bound and re-arranging yields

(r )
log(8D/δ) kρ(r)k∞ γk θk
b span log(8D/δ)
κkθb − θ∗ k∞ ≤ c b −1 σ
γk(I − γ P) b(θ)k
b ∞+ + .
N 1−γ 1−γ N
q
2cγ log(8D/δ) log(8D/δ)
where we have introduced the shorthand κ : = 1 − 1−γ N + N . Finally, by choos-
ing the pre-factor c1 in the lower bound N ≥ c1 γ 2 log(8D/δ)

(1−γ)2
large enough, we can ensure that κ ≥ 21 ,
thereby completing the proof of Theorem 1(a).
5.1.1 Proof of Lemma 1

We now turn to the proof of the auxiliary result in Lemma 1. We begin by noting that the statement
is trivially true when N ≤ log(8D/δ), since we have
σ(θ∗ ) kθ∗ kspan 1.
Thus, by adjusting the constant factors in the statement of the lemma, it suffices to prove the lemma
under the assumption N ≥ c log(8D/δ) for a sufficiently large absolute constant c. Accordingly, we
make this assumption for the rest of the proof.
17
We use the following convenient notation for expectations. Let E denote the vector expectation
operator, with the convention that E[v] = Pv. Similarly, let E b denote the vector empirical expec-
tation operator, given by E[v] = Pv. These operators are applied elementwise by definition, and
b b
we let Ei and E
b i denote the i-th entry of each operator, respectively.
With this notation, we have
σ 2 (θ∗ ) = E |θ∗ − E[θ∗ ]|2

2
b |θ∗ − E[θ∗ ]|2
b θ∗ − E[θ∗ ] + E
= (E − E)
2 2
∗ b ∗ 2

∗ ∗ b ∗ ∗
(E − E)θ − E[θ ] + 2 E[θ ] − E[θ ] + 2E θ − E[θ ]
b b
2 2
= (E − E) b ∗ ] − E[θ∗ ] +2b
b θ∗ − E[θ∗ ] + 2 E[θ σ 2 (θ∗ ). (19)
| {z } | {z }
T1 T2
We claim that the terms T1 and T2 are bounded as follows:

σ 2 (θ∗ ) log(8D/δ)
T1 + ckθ∗ k2span · 1, and (20a)
( 4 N
2 )
log(8D/δ) 2 ∗ log(8D/δ)
T2 c · σ (θ ) + kθ∗ kspan ·1 , (20b)
N N
where each bound holds with probability at least 1 − 4δ . Taking these bounds as given for the
moment, as long as N ≥ c0 log(8D/δ) for a sufficiently large constant c0 , we can ensure that
σ 2 (θ∗ ) log(8D/δ)
T1 + T2 + ckθ∗ k2span · 1,
2 N
Substituting back into our earlier bound (19), we find that
σ 2 (θ∗ ) log(8D/δ)
σ 2 (θ∗ ) + c0 kθ∗ k2span
2b · 1.
2 N
Rearranging and taking square roots entry-wise, we find that
r r
∗ 2 ∗ 0 ∗ 2
log(8D/δ) ∗ 0 ∗ log(8D/δ)
σ(θ ) 4b σ (θ ) + 2c kθ kspan · 1 2bσ (θ ) + c kθ kspan · 1.
N N
b(θ∗ ) σ
Finally noting that we have the entry-wise inequality σ b + |θb − θ∗ | establishes the claim
b(θ)
of Lemma 1.
It remains to prove the bounds (20a) and (20b).
2
Proof of bound (20a): For each index i ∈ [D], define the random variable Yi : = θJ∗ − Ei [θ∗ ] ,
where J is an index chosen at random from the distribution pi . By definition, each random variable
Yi is non-negative, and so with E now denoting the regular expectation of a scalar random variable,
we have lower tail bound (Proposition 2.14, [Wai19a])
ns2

P [E[Yi ] − Yi ≥ s] ≤ exp − for all s > 0.
2E[Yi2 ]
18
Moreover, we have Yi ≤ kθ∗ k2span almost surely, from which we obtain
E[Yi2 ] ≤ kθ∗ k2span Ei (θ∗ − E[θ∗ ])2 = kθ∗ k2span σi2 (θ∗ ).

Putting together the pieces yields the elementwise inequality

r
log(8D/δ) (i) σ 2 (θ ∗ ) log(8D/δ)
∗
T1 ckθ kspan · σ(θ∗ ) + c0 kθ∗ k2span ,
N 8 N
with probability at least 1 − δ/3, where in step (i), we have used the inequality 2ab ≤ νa2 + ν −1 b2 ,
which holds for any triple of positive scalars (a, b, ν).
Proof of the bound (20b): From Bernstein’s inequality, we have the element-wise bound
(r )

b ∗

∗ log(8D/δ) ∗ ∗ log(8D/δ)
E[θ ] − E[θ ] c · σ(θ ) + kθ kspan ·1
N N
with probability at least 1 − δ/4, and hence

( )
log(8D/δ) 2

log(8D/δ) 2 ∗ ∗
T2 c · σ (θ ) + kθ kspan ·1 ,
N N
as claimed.
5.2 Proof of Theorem 1, part (b)

Once again, we employ the shorthand θb ≡ θbplug for notational convenience, and also the short-
b = θb − θ∗ . Note that it suffices to show the inequality
hand ∆
n o δ
P kθb − θ∗ k∞ ≥ cγ (I − γP)−1 |(P b − P)θ∗ |
+ c(1 − γ)−1 kb
r − rk∞ ≤ , (21)

∞ 2
from which the theorem follows by application of a Bernstein bound to the first term and Hoeffding
bound to the second, in a similar fashion to the inequalities (16). We therefore dedicate the rest of
the proof to establishing inequality (21).
5.2.1 Proving the bound (21)

We have
b = θb − θ∗ = γ P
∆ b θb − γPθ∗ + (b
r − r) = γ(P
b − P)θb + γP∆ r − r),
b + (b
which implies that

b − (I − γP)−1 (b
∆ r − r) = γ(I − γP)−1 (P
b − P)θb = γ(I − γP)−1 (P b + γ(I − γP)−1 (P
b − P)∆ b − P)θ∗ .
(22)
Since all entries of (I − γP)−1 are non-negative, we have the element-wise inequalities
b γ(I − γP)−1 |(P
|∆| b + γ(I − γP)−1 |(P
b − P)∆| b − P)θ∗ | + (I − γP)−1 |b
r − r|
γ(I − γP)−1 |(P

b − P)∆| b − P)θ∗ | + 1 kb
b + γ(I − γP)−1 |(P r − rk∞ · 1. (23)
1−γ
19
The second and third terms are already in terms of the desired population-level functionals in
equation (21). It remains to bound the first term.
Note that the key difficulty here is the fact that the two matrices P b − P and ∆ b are not
independent. As a first attempt to address this dependence, one is tempted to use the fact that
provided N is large enough, each row of P b − P has small `1 -norm; for instance, see Weissman et
+
al. [WOS 03] for sharp bounds of this type. In particular, this would allow us to work with the
entry-wise bounds
r
|(P
b − P)∆|b kP b ∞ · 1 - C D k∆k
b − Pk1,∞ k∆k b ∞ · 1,
N
where the final relation hides logarithmic factors in the pair (D, δ). Proceedingq in this fashion,
we would then bound each entry in the first term of equation (23) by γ(1 − γ) −1 D b
N k∆k∞ ; then
q
choosing N large enough such that γ(1 − γ)−1 N D
≤ 1/2 suffices to establish bound (21). However,
γ2
this requires a sample size N & (1−γ)2
D, while we wish to obtain the bound (21) with the sample
γ2
size N & (1−γ)2
. This requires a more delicate analysis.
Our analysis instead proceeds entry-by-entry, and uses a leave-one-out sequence to carefully
decouple the dependence between P b − P and ∆. b Let us introduce some notation to make this
precise. For each i ∈ [D], recall that we used pbi and pi to denote row i of the matrices P b and P,
respectively. Let P (i)
b denote the i-th leave-one-out transition matrix, which is identical to P b except
(i) (i) −1
with row i replaced by the population vector pi . Let θ : = (I − γ P ) r be the value function
b b
estimate based on P b (i) and the true reward vector r, and denote the associated difference vector
by ∆b (i) : = θb(i) − θ∗ .
Now note that we have
h i
b − P)∆
(P b = hb pi − pi , ∆i
b = hb
pi − pi , ∆b (i) i + hb
pi − pi , θb − θb(i) i.
i
This decomposition is helpful because, now, the vectors pbi − pi and ∆ b (i) are independent by con-
struction, so that standard tail bounds can be used on the first term. For the second term, we
use the fact that θb ≈ θb(i) , since the latter is obtained by replacing just one row of the estimated
transition matrix. Formally, this closeness will be argued by using the matrix inversion formula.
We collect these two results in the following lemma.
Lemma 2. Suppose that the sample size is lower bounded as N ≥ c0 γ 2 log(8D/δ)
(1−γ)2
. Then with proba-
δ
bility at least 1 − 2D and for each i ∈ [D], we have
( r )
log(8D/δ)
pi − pi , ∆
γ|hb b (i) i| ≤ c γk∆k
b ∞ pi − pi , θ∗ i| + kr − rbk∞
+ γ |hb and (24a)
N
( r )
log(8D/δ)
pi − pi , θb(i) − θi|
γ|hb b ≤ c γk∆k b ∞ pi − pi , θ∗ i| + kr − rbk∞ .
+ γ |hb (24b)
N
With this lemma in hand, let us complete the proof. Combining the bounds of Lemma 2 with
a union bound over all D entries yields the elementwise inequality
( r )

∗
log(8D/δ)
γ (P − P)∆b cγ (P
b − P)θ + c γk∆k + kb
r − rk∞ 1
b b ∞
N
20
with probability at least 1 − δ/2. Since the entries of (I − γP)−1 are non-negative, we can multiply
both sides of this inequality by it, thereby obtaining
( r )
c log(8D/δ)
γ(I − γP)−1 (P − P)∆ b − P)θ∗ +
b cγ(I − γP)−1 (P γk∆k + kb
r − rk∞ 1.
b b ∞
1−γ N
Returning to the upper bound (23), we have shown that

r
k∆k
b ∞ log(8D/δ) c
k∆k∞ ≤ cγ + c0 γ (I − γP)−1 |(P
b − P)θ∗ |
+ kr − rbk∞ .
b
1−γ N ∞ 1−γ
Under the assumed lower bound on the sample size N ≥ c0 γ 2 log(8D/δ)

(1−γ)2
, this inequality implies that

b ∞ ≤ c0 γ −1 b

∗ c
k∆k (I − γP) |( P − P)θ | + kr − rbk∞ ,
1−γ

∞
as claimed (21).
We now proceed to a proof of Lemma 2, which uses the following structural lemma relating the
b (i) and ∆.
quantities ∆ b
Lemma 3. Suppose that the sample size is lower bounded as N ≥ c0 γ 2 log(8D/δ)

(1−γ)2
. Then with proba-
δ
bility at least 1 − 4D and for each i ∈ [D], we have
c n o
b (i) k∞ ≤ ck∆k
k∆ b ∞+ pi − pi , θ∗ i| + kb
γ |hb r − rk∞ . (25)
1−γ
This lemma is proved in Section 5.2.3 to follow.

We prove the two bounds in turn.
Proof of inequality (24a): Note that pbi − pi and ∆ b (i) are independent by construction, so that
the Hoeffding inequality yields
r
|hb
pi − pi , ∆ b (i) k∞ log(8D/δ)
b (i) i| ≤ ck∆ (26)
N
with probability at least 1 − δ/(4D).
Using this in conjunction with inequality (25) from Lemma 3 yields the bound
r r
log(8D/δ) cγ log(8D/δ) n o
γ|hb b (i) i| ≤ cγk∆k
pi − pi , ∆ b ∞ + pi − pi , θ∗ i| + kb
γ |hb r − rk∞
N 1−γ N
r
(i)
b ∞ log(8D/δ) + cγ |hb
≤ cγk∆k pi − pi , θ∗ i| + ckb
r − rk∞ ,
N
γ 2
where in step (i), we have used the lower bound on the sample size N ≥ c0 (1−γ)2 log(8D/δ).
21
Proof of inequality (24b): The proof of this claim is more involved. Using the relation (22)
(with suitable modifications of terms), we have
b −1 (P
θb(i) − θb = γ(I − γ P) b (i) − P) b −1 (r − rb)
b θb(i) + (I − γ P)

b −1 ei hb
= −γ(I − γ P) b −1 (r − rb).
pi − pi , θb(i) i + (I − γ P) (27)
Moreover, the Woodbury matrix identity [HJ85] yields

−1 −1 b (i) )−1 ei (b
(I − γ P pi − pi )T (I − γ P b (i) )−1
M : = I − γP
b b (i)
− I − γP = −γ .
1 − γ(b pi − pi )T (I − γ P b (i) )−1 ei
Consequently,

pi − pi , θb(i) − θi
hb b = −γ(b pi − pi )> (I − γ P)
b −1 ei hb pi − pi , θb(i) i + (b pi − pi )> (I − γ P)
b −1 (r − rb)

= −γ(bpi − pi )> (I − γ P
b (i) )−1 ei hb pi − pi , θb(i) i + (bpi − pi )> (I − γ P
b (i) )−1 (r − rb)

− γ(bpi − pi )> Mei hb pi − pi , θb(i) i + (b pi − pi )> M(r − rb)
2Z 2 − Z 1 − 2Zi
i
= hbpi − pi , θb(i) i · i
+ Ti · , (28)
1 − Zi 1 − Zi
where we have defined, for convenience, the random variables
pi − pi )> (I − γ P
Zi : = γ(b b (i) )−1 ei and pi − pi )> (I − γ P
Ti : = (b b (i) )−1 (r − rb).
b (i) )−1 (r − rb), applying the Hoeffding bound yields

Since pbi − pi is independent of the vector (I − γ P
the inequality
r
c log(8D/δ)
|Ti | ≤ kr − rbk∞
1−γ N
with probability exceeding 1 − δ/(4D).

b (i) )−1 ei and
On the other hand, exploiting independence between the vectors pbi − pi and (I − γ P
applying the Hoeffding bound, we also have
r
cγ log(8D/δ)
|Zi | ≤
1−γ N
γ 2
with probability least 1 − δ/(4D). Taking N ≥ c0 (1−γ) 2 log(8D/δ) for a sufficiently large constant
0
c ensures that γ|Ti | ≤ kr − rbk∞ and |Zi | ≤ 1/4, so that with probability exceeding 1 − δ/(2D),
inequality (28) yields
n o
pi − pi , θb(i) − θi|
γ|hb b ≤ c γ|hb pi − pi , θb(i) i| + kr − rbk∞
n o
≤ c γ|hb
pi − pi , ∆b (i) i| + γ|hbpi − pi , θ∗ i| + kr − rbk∞ .
Finally, applying part (a) of Lemma 2 completes the proof.
22
b (i) , and the explicit bound (26). We have
Recall our leave-one-out matrix P
r

∗ log(8D/δ)
hb (i)
pi − pi , θ i ≤ hb (i)
pi − pi , ∆ i + |hb (i)
pi − pi , θ i| ≤ ck∆ k∞ pi − pi , θ∗ i| (29)
+ |hb
b b b
N
with probability at least 1 − δ/(4D). Substituting inequality (29) into the bound (27), we find that
( r )
c log(8D/δ)
kθb(i) − θk
b∞≤ b (i) k∞ ·
γk∆ pi − pi , θ∗ i| + kr − rbk∞ .
+ γ |hb (30)
1−γ N
Finally, the triangle inequality yields
b (i) k∞ ≤ k∆k
k∆ b ∞ + kθb(i) − θk
b∞
( r )
c log(8D/δ)
≤ k∆k
b ∞+ b (i) k∞ ·
γk∆ pi − pi , θ∗ i| + kr − rbk∞ .
+ γ |hb
1−γ N
For N ≥ c0 γ 2 log(8D/δ)
(1−γ)2
with c0 sufficiently large, we have
c n o
b (i) k∞ ≤ ck∆k
k∆ b ∞+ pi − pi , θ∗ i| + kb
γ |hb r − rk∞
1−γ
δ
with probability at least 1 − 4D , which completes the proof of Lemma 3.
Since Corollary 1 follows from Theorem 1, we prove it first before moving to a proof of Theorem 2
in Section 5.4.
5.3 Proof of Corollary 1

b −1 k1,∞ ≤
In order to prove part (a), consider inequality (14) and further use the fact that k(I − γ P) 1
1−γ
to obtain the element-wise bound
|θb − θ∗ |
γ b − P)θ∗ k∞ 1 + kb
k(P
r − rk∞
· 1.
1−γ 1−γ
Applying Bernstein’s bound to the first term and Hoeffding’s bound to the second completes the
proof.
In order to prove part (b) of the corollary, we apply Lemma 7 of Azar et al. [AMK13]—in
particular, equation (17) of that paper. Tailored to this setting, their result leads to the point-wise
bound
rmax
k(I − γP)−1 σ(θ∗ )k∞ ≤ c .
(1 − γ)3/2
We also have the bound

2rmax
kθ∗ kspan ≤ 2kθ∗ k∞ = 2k(I − γP)−1 rk∞ ≤ ,
1−γ
23
so that combining the pieces and applying Theorem 1(b), we obtain
(r )
c log(8D/δ) rmax log(8D/δ) rmax
kθb − θ∗ k∞ ≤ γ + kρ(r)k∞ + γ · .
(1 − γ) N (1 − γ)1/2 N 1−γ
Finally, when N ≥ c1 log(8D/δ)

1−γ for a sufficiently large constant c1 , we have
r
log(8D/δ) rmax log(8D/δ) rmax
≤c ,
N 1−γ N (1 − γ)1/2
thereby establishing the claim.
5.4 Proof of Theorem 2

For all of our lower bounds, we assume that the reward distribution takes the Gaussian form
Dr ( · | j) = N (rj , %2 ) (31)
for each state j. Note that this reward distribution satisfies kρ(r)k∞ = % by construction.
Let us begin with a short overview of our proof, which proceeds in two steps. First, we suppose
that the transition matrix P is known exactly, and the hardness of the estimation problem is due
to noisy observations of the reward function. In particular, letting MI (rmax , %) denote the class of
all MRPs with the specific reward observation model (31), and for which the transition matrix is
the identity matrix I and the rewards are uniformly bounded as krk∞ ≤ rmax , we show that
( r )
∗ % log(D) rmax
inf sup Ekθb − θ k∞ ≥ c · ∧ . (32)
θb θ∗ ∈MI (rmax ,%) 1−γ N 1−γ
Note that for each pair of positive scalars (ϑ, rmax ) we have the inclusions
MI (rmax , %) ⊆ Mvar (ϑ, %) and MI (rmax , %) ⊆ Mrew (rmax , %),
and so that the lower bound (32) carries over to the classes Mvar (ϑ, %) and Mrew (rmax , %).
Next, we suppose that the population reward function r is known exactly (% = 0), and the
hardness of the estimation problem is only due to uncertainty in the transitions. Under this setting,
we prove the lower bounds
r
ϑ log(D/2)
inf sup Ekθb − θ∗ k∞ ≥ c · , and (33a)
θb θ∗ ∈Mvar (ϑ,0) 1−γ N
r
rmax log(D/2)
inf sup Ekθb − θ∗ k∞ ≥ c 3/2
· . (33b)
θb θ∗ ∈Mrew (rmax ,0) (1 − γ) N
Since Mvar (ϑ, 0) ⊂ Mvar (ϑ, %) for any % > 0, these lower bounds also carry over to the more general
setting. The minimax lower bounds of Theorem 2 are obtained by taking the maximum of the
bounds (32) and (33). Let us now establish the two previously claimed bounds.
24
5.4.1 Proof of claim (32)
For some positive scalar α to be chosen shortly, consider D distinct reward vectors {r(1) , . . . , r(D) },
where the vector r(i) ∈ RD has entries
(
(i) α if i = j
rj : = for all j ∈ [D].
0 otherwise,
Denote by R(i) the MRP with reward function r(i) ; and transition matrix I. Thus, the i-th value
function is given by the vector (θ∗ )(i) : = 1−γ
1
r(i) .
By construction, we have k(θ∗ )(i) − (θ∗ )(j) k∞ = α/(1 − γ) for each pair of distinct indices (i, j).
Furthermore, the KL divergence between Gaussians of variance %2 centered at r(i) and r(j) is given
by
kr(i) − r(j) k2 2α2
DKL N (r(i) , %2 I) k N (r(j) , %2 I) = 2
= .
%2 %2
Thus, applying the local packing version of Fano’s method (§15.3.3, [Wai19a]), we have
2
2 α%2 N + log 2
!
∗ α
inf sup Ekθb − θ k∞ ≥c 1− .
θb θ∗ ∈{M(i) }i∈[D] 1−γ log D
q
Setting α = % log D
6N ∧ rmax yields the claimed lower bound.
5.4.2 Proof of claim (33)

This lower bound is based on a modification of constructions used by Lattimore and Hutter [LH14]
and Azar et al. [AMK13]. Our proof, however, is tailored to the generative observation model.
Our proof is structured as follows. First, we construct a family of “hard” MRPs and prove a
minimax lower bound as a function of parameters used to define this family. Constructing this
family of hard instances requires us to first define a basic building block: a two-state MRP that
was illustrated in Figure 2(a). After obtaining this general lower bound, we then set the scalars
that parameterize the hard class MRP appropriately to obtain the two claimed bounds.
We now describe the two-state MRP in more detail. For a pair of parameters (p, τ ), each in the
unit interval [0, 1], and a positive scalar ν, consider the two-state Markov reward process R0 (p, ν, τ ),
with transition matrix and reward vector given by

p 1−p ν
P0 = and r0 = ,
0 1 ν·τ
respectively. See Figure 2 for an illustration of this MRP.

A straightforward calculation yields that it has value function and corresponding standard
deviation vector given by
" # " √ #
1−γ+γτ (1−p) (1−τ ) p(1−p)
∗
θ (p, ν, τ ) = ν (1−γp)(1−γ)
τ and σ(θ∗ ) = ν 1−γp , (34)
1−γ 0
25
respectively, where we have used the shorthand θ∗ ≡ θ∗ (p, ν, τ ). We also have kθ∗ kspan = ν(1−τ )
1−γp ;
the two scalars (ν, τ ) allow us to control the quantities kσ(θ∗ )k∞ and kθ∗ kspan . Index the states of
this MRP by the set {0, 1}, and consider now a sample drawn from this MRP under the generative
model. We see a pair of states drawn according to the respective rows of the transition matrix
P0 ; the first state is drawn according to the Bernoulli distribution Ber(p), and the second state is
deterministic and equal to 1. For convenience, we use P(p) = (Ber(p), 1) to denote the distribution
of this pair of states.
Our hard class of instances is based in part on the difficulty of distinguishing two such MRPs that
are close in a specific sense. Let us make this intuition precise. For two scalar values 0 ≤ p2 ≤ p1 ≤ 1,
some algebra yields the relation
(p1 − p2 )(1 − τ )
kθ∗ (p1 , ν, τ ) − θ∗ (p2 , ν, τ )k∞ = ν · . (35)
(1 − γp1 )(1 − γp2 )
In the sequel, we work with the choices
r
4γ − 1 1 p1 (1 − p1 )
p1 = and p2 = p1 − log(D/2),
3γ 8 N
1
which, under the assumed lower bound on the sample size N , are both scalars in the range 2, 1
for all discount factors γ ∈ 12 , 1 . Moreover, it is worth noting the relations

1−γ 1−γ 1−γ

1 − p1 = , c1 ≤ 1 − p2 ≤ c2
3γ 3γ 3γ
4
1 − γp1 = (1 − γ), and c1 (1 − γ) ≤ 1 − γp2 ≤ c2 (1 − γ), (36)
3
cγ
where the inequalities on the right hold provided N ≥ 1−γ log(D/2) for a sufficiently large con-
stant c. Here the pair of constants (c1 , c2 ) are universal, depend only on c, and may change from
line to line.
We also require the following lemma, proved in Section 5.4.3 to follow, which provides a useful
bound on the KL divergence between P(p1 ) and P(p2 ).
Lemma 4. For each pair p, q ∈ [1/2, 1), we have
(p − q)2
DKL (P(p)kP(q)) ≤ .
(p ∨ q)(1 − (p ∨ q))
We are now in a position to describe the hard family of MRPs over which we prove a gen-
eral lower bound. Suppose that D is even for convenience, and consider a set of D/2 “master”
MRPs M̄ : = {R1 , . . . , RD/2 } each on D states10 constructed as follows. Decompose each master
MRP into D/2 sub-MRPs of two states each; index the k-th sub-MRP in the j-th master MRP
by Rj,k . For each pair j, k ∈ [D/2], set
(
R0 (p1 , ν, τ ) if j 6= k
Rj,k =
R0 (p2 , ν, τ ) otherwise.
10
Note that this step is only required in order to “tensorize” the construction in order to obtain the optimal
dependence on the dimension. If, instead of the `∞ error, one was interested in estimating the value function at a
fixed state of the MRP, then this tensorization is no longer needed.
26
Let θj∗ denote the value function corresponding to MRP Rj , and let PN j denote the distribution of
state transitions observed from the MRP Rj under the generative model. Also note that for each
i ∈ [D/2], we have
p
∗ (1 − τ ) p1 (1 − p1 )
kσ(θi )k∞ = ν . (37)
(1 − γp1 )
Lower bounding the minimax risk over this class: We again use the local packing form of
Fano’s method (§15.3.3, [Wai19a]) to establish a lower bound. Choose some index J uniformly at
random from the set [D/2], and suppose that we draw N i.i.d. samples Y N : = (Y1 , . . . , YN ) from
the MRP RJ under the generative model. Here each Yi ∈ X D represents a random set of D states,
and the goal of the estimator is to identify the random index J and, consequently, to estimate the
value function θJ∗ . Let us now lower bound the expected error incurred in this (D/2)-ary hypothesis
testing problem. Fano’s inequality yields the bound
I(J; Y N ) + log 2

∗ 1 ∗ ∗
inf sup Ekθ − θ k∞ ≥ min kθj − θk k∞ 1 −
b , (38)
θb θ∗ ∈M̄ 2 j6=k log(D/2)
where I(J; Y N ) denotes the mutual information between J and Y N .

Let us now bound the two terms that appear in inequality (38). By equation (35), we have
(p1 − p2 )(1 − τ )
kθj∗ − θk∗ k∞ = ν · for all 1 ≤ j 6= k ≤ D/2.
(1 − γp1 )(1 − γp2 )
Furthermore, since the samples Y1 , . . . , YN are i.i.d., the chain rule of mutual information yields
1
I(J; Y N ) = I(J; Y1 ) ≤ max DKL (Pj kPk )
N j6=k
(i)
= DKL (P(p1 )kP(p2 )) + DKL (P(p2 )kP(p1 ))
(ii) (p1 − p2 )2
≤2 ,
p1 (1 − p1 )
where step (i) is a consequence of the construction, which ensures that the distributions Pj and Pk
coincide on all but the j-th and k-th sub-MRPs. On the other hand, step (ii) follows from Lemma 4,
and the fact that p2 ≤ p1 .
Putting together the pieces, we now have
(p1 −p2 )2
 
ν (p1 − p2 )(1 − τ )  2N p1 (1−p1 ) + log 2
inf sup Ekθb − θ∗ k∞ ≥ · 1− .
θb θ∗ ∈M̄ 2 (1 − γp1 )(1 − γp2 ) log(D/2)
q
p1 (1−p1 )
Recall the choice p1 − p2 = 81 N log(D/2). For D ≥ 8, this ensures, for a small enough
positive constant c, the bound
p r
∗ (1 − τ ) p1 (1 − p1 ) log(D/2) 1
inf sup Ekθb − θ k∞ ≥ cν · . (39)
θb θ∗ ∈M̄ (1 − γp1 ) N 1 − γp2
With the relation (39) at hand, we now turn to proving the two sub-claims in equation (33).
27
Proof of claim (33a): Recall equation (37); for i ∈ [D/2], we have
p
(1 − τ ) p1 (1 − p1 ) (1 − τ )
kσ(θi∗ )k∞ =ν and kθi∗ kspan = ν .
(1 − γp1 ) (1 − γp1 )
√
Now for every pair of scalars (ϑ, ζ) satisfying ϑ = ζ 1 − γ, set τ = 1/2 and ν = 2ζ(1 − γp1 ). With
this choice of parameters, we have the inclusion Mvar (ϑ, 0) ∩ Mvfun (ζ, 0) ⊆ M̄, and evaluating the
bound (39) yields
r
∗ log(D/2) 1
inf sup Ekθb − θ k∞ ≥ cϑ
θb θ∗ ∈Mvar (ϑ,0)∩Mvfun (ζ,0) N 1 − γp2
r
(ii) log(D/2) 1
= cϑ ,
N 1−γ
where in step (ii), we have used inequality (36). The same lower bound clearly also extends to the
set Mvar (ϑ, 0) ∩ Mvfun (ζ, 0) for ζ ≥ ϑ(1 − γ)−1/2 ; this establishes part (a) of the theorem.
Proof of claim (33b): Given a value rmax , set τ = 0 and ν = rmax and note that the rewards of
all the MRPs in the set M̄ satisfy krk∞ ≤ ν. Hence, we have Mrew (rmax , 0) ⊆ M̄ for this choice of
parameters. Using inequality (39) and recalling the bounds (36) once again, we have
p r
p1 (1 − p1 ) log(D/2) 1
inf sup Ekθb − θ∗ k∞ ≥ crmax ·
θb θ∗ ∈Mrew (rmax ,0) (1 − γp1 ) N 1 − γp2
r
rmax log(D/2)
≥c 3/2
.
(1 − γ) N
By construction, the second state of the Markov chain is absorbing, so it suffices to consider the KL
divergence between the first components of the distributions P(p) and P(q). These are Bernoulli
random variables Ber(p) and Ber(q), and the following calculation bounds their KL divergence:
p 1−p
DKL (Ber(p)kBer(q)) = p log + (1 − p) log
q 1−q
(ii) p−q q−p
≤ p· + (1 − p) ·
q 1−q
(p − q) 2
= ,
q(1 − q)
where step (ii) uses the inequality log(1 + x) ≤ x, which is valid for all x > −1. A similar inequality
holds with the roles of p and q reversed, and the denominator of the expression is lower for the
larger value p ∨ q. This completes the proof.
28
5.5 Proof of Theorem 3
Note that the median-of-means operator is applied elementwise; denote the i-th such operator
by M ci . Let Mc − P denote the elementwise difference of operator M c and the linear operator P;
its i-th component is given by the operator Mci (·) − hpi , ·i.
We require two technical lemmas in the proof. The power of the median-of-means device is
clarified by the first lemma, which is an adaptation of classical results (see, e.g., [NY83, JVV86]).
Lemma 5. Suppose that K = 8 log(4D/δ) and m = bN/Kc. Then there is a universal constant c
such that for each index i ∈ [D] and each fixed vector θ ∈ RD , we have
( r )
log(8D/δ) δ
Pr |(M ci − pi )(θ)| ≥ c σi (θ) ≤ .
N 4D
Comparing this lemma to the Bernstein bound (cf. equation (16b)), we see that we no longer
pay in the span semi-norm kθ∗ kspan , and this is what enables us to establish the solely variance-
dependent bound (10).
We also require the following lemma that guarantees that the median-of-means Bellman operator
is contractive.
Lemma 6. The median-of-means operator is 1-Lipschitz in the `∞ -norm, and satisfies
|M(θ
c 1 ) − M(θ
c 2 )| ≤ kθ1 − θ2 k∞ for all vectors θ1 , θ2 ∈ RD .
Consequently, the empirical operator TbNMoM is γ-contractive in `∞ -norm and satisfies
|TbNMoM (θ1 ) − TbNMoM (θ2 )| ≤ γkθ1 − θ2 k∞ for all pairs of value functions (θ1 , θ2 ).
See Section 5.5.1 for the proof of Lemma 6.
We are now in a position to establish the theorem, where we now use the shorthand θb ≡ θbMoM for
convenience. Note that the vectors θ∗ and θb satisfy the fixed point relations
θ∗ = r + γPθ∗ , and θb = rb + γ M(
c θ),
b
b = θb − θ∗ satisfies the relation

respectively. Taking differences, the error vector ∆
θb − θ∗ = γ(M(
c∆ b + θ∗ ) − Pθ∗ ) + rb − r
= γ(M(
c∆ b + θ∗ ) − M(θ
c ∗ )) + γ(M c − P)(θ∗ ) + (b
r − r).
Taking `∞ -norms on both sides and using the triangle inequality, we have
k∆k c ∗ + ∆)
b ∞ ≤ γkM(θ c ∗ )k∞ + γ|(M
b − M(θ c − P)(θ∗ )| + kb
r − rk∞
(i)
≤ γk∆k c − P)(θ∗ )| + kb
b ∞ + γ|(M r − rk∞ ,
where step (i) is a result of Lemma 6. Finally, applying Lemma 5 in conjunction with the Hoeffding
inequality and a union bound over all D indices completes the proof.
29
The second claim follows directly from the first by noting that
|TbNMoM (θ1 ) − TbNMoM (θ2 )| = γ|M(θ

c 1 ) − M(θ
c 2 )|.
In order to prove the first claim, recall that for each θ ∈ RD , we have M(θ)
c = med(b
µ1 (θ), . . . , µ
bK (θ)),
where the median—defined as the bK/2c-th order statistic—is taken entry-wise. By definition, for
each i ∈ [K], we have
 

1 X
kb
µi (θ1 ) − µ
bi (θ2 )k∞ = 
m Zk  (θ1 − θ2 )

k∈Di
∞

1 X
≤ m Zk
kθ1 − θ2 k∞
k∈Di
1,∞
(i)
= kθ1 − θ2 k∞ ,
1 P
where step (i) is a result of the fact that m k∈Di Zk is a row stochastic matrix with non-negative
entries. Finally, we have the entry-wise bound
|M(θ
c 1 ) − M(θ
c 2 )| = | med(b bK (θ1 )) − med(b
µ1 (θ1 ), . . . , µ µ1 (θ2 ), . . . , µ
bK (θ2 ))|
(ii)
kθ1 − θ2 k∞ · 1,
where step (ii) follows from our definition of the median as the bK/2c-th order statistic, and
Lemma 7 to follow. This completes the proof of Lemma 6.
Lemma 7. For each pair of vectors (u, v) of dimension D and each index i ∈ [D], we have
|u(i) − v(i) | ≤ ku − vk∞ .
Proof. Assume without loss of generality that the entries of u are sorted in increasing order (so
that u1 ≤ u2 ≤ . . . ≤ uD ), and let w denote a vector containing the entries of v sorted in increasing
order. We then have
(i)
|u(i) − v(i) | = |ui − wi | ≤ ku − wk∞ ≤ ku − vk∞ ,
where step (i) follows from the rearrangement inequality applied to the `∞ -norm [Vin90].
6 Discussion
Our work investigates the local minimax complexity of value function estimation in Markov reward
processes. Our upper bounds are instance-dependent, and we also provide minimax lower bounds
that hold over natural subsets of the parameter space. The plug-in approach is shown to be optimal
over the class of MRPs with bounded rewards, and a variant based on the median-of-means device
achieves optimality over the class of MRPs having value functions with bounded variance.
30
Our results also leave a few interesting technical questions unresolved. Let us start with two
inter-related questions: Is Corollary 1(a) sharp, say up to a logarithmic factor in the dimension? Is
the median-of-means approach minimax-optimal over the class of MRPs having bounded rewards?
We conjecture that both of these questions can be answered in the affirmative, but there are
technical challenges to overcome. For instance, while the median-of-means device is crucial to
removing the span-norm dependence in Corollary 1(a), it leads to a non-linear update rule that
needs to be much more carefully handled in order to ensure minimax optimality.
A second set of technical questions concerns the definition of “locality” in our bounds. Are
our results also sharp under alternative local minimax parameterizations (say in terms of the
functional k(I − γP)−1 σ(θ∗ )k∞ )? Is there a more fine-grained lower bound analysis that shows the
(sub)-optimality of these approaches, and are there better adaptive procedures for this problem?
The literature on estimating functionals of discrete distributions [JVHW15] shows that additional
refinements over the plug-in approach are usually beneficial; is that also the case here? There is also
the related question of whether a minimax lower bound can be proved over a local neighborhood
of every point θ∗ . We remark that guarantees of this flavor exist in a variety of related problems in
both the asymptotic and non-asymptotic settings [vdV00, CL04, ZCDL16]. Indeed, in a follow-up
paper [KPR+ 20] with a superset of the current authors, we have shown such a local lower bound
for this problem, which is achieved via stochastic approximation coupled with a variance reduction
device.
In a complementary direction, another interesting question is to ask how function approximation
affects these bounds. Our techniques should be useful in answering some of these questions, and
also more broadly in proving analogous guarantees in the more challenging policy optimization
setting.
Finally, there is the question of removing our assumption on the generative model: How does
the plug-in estimator behave when it is computed on a sampled trajectory of the system? A classical
solution is the blocking method of simulating the generative model from such samples [Yu94]: given
a sampled trajectory, chop it into pieces of length (roughly) equal to the mixing time of the Markov
chain, and to treat the respective first sample from each of these pieces as (approximately) inde-
pendent. But clearly, this approach is somewhat wasteful, and there have been recent refinements
in related problems when the mixing time can become arbitrarily large [SMT+ 18]. It would be
interesting to explore these approaches and derive instance-dependent guarantees in the L2µ -norm,
where µ is the stationary distribution of the Markov chain.
Acknowledgements
AP was supported in part by a research fellowship from the Simons Institute for the Theory of
Computing. MJW and AP were partially supported by National Science Foundation Grant DMS-
1612948 and Office of Naval Research Grant ONR-N00014-18-1-2640.
A Dependence of plug-in error on span semi-norm

In this section, we state and prove a proposition that provides a family of MRPs in which the
`∞ -error of the plug-in estimator can be completely characterized by the span semi-norm of the
optimal value function.
31
Proposition 1. Suppose that the rewards are observed noiselessly, with ρ(r) = 0. There is a pair
of universal positive constants (c1 , c2 ) such that for any triple of positive scalars (ζ, N, D), there is
a D-state MRP for which
kσ(θ∗ )k∞ 3 ζ
kθ∗ k∞ = ζ and ≤√ · , (40)
N D N
and for which the error of the plug-in estimator satisfies
ζ (a)h i (b) ζ log(1 + D/3)

≤ E kθbplug − θ∗ k∞ ≤ c2 γ (log log(1 + D/3))−1 ∧ 1 .

c1 γ (41)
N N
A few comments are in order. First, note that equation (40) guarantees that we have
1 1 1
√ · k(I − γP)−1 σ(θ∗ )k∞ . √ kθ∗ kspan
N D (1 − γ)N
for large values of the dimension D, so that the first term in the guarantee (5b) is dominated by
1
the second. In particular, suppose that D (1−γ) 2 ; then we have
1 1
√ · k(I − γP)−1 σ(θ∗ )k∞ kθ∗ kspan .
N N
In other words, if our analysis was loose in that the error of the plug-in estimator depended only
on the functional √1N · k(I − γP)−1 σ(θ∗ )k∞ , then it would be impossible to prove a lower bound
kθ∗ k
span
that involves the quantity N . On the other hand, equation (41) shows that this such a lower
kθ∗ k
bound can indeed be proved: the plug-in error is characterized precisely by the quantity γ Nspan
up to a logarithmic factor in the dimension D.
Second, note that while equation (41) shows that the plug-in error must have some span semi-
norm dependence, it falls short of showing the stronger lower bound
ζ h i
c1 ≤ E kθbplug − θ∗ k∞ , (42)
(1 − γ)N
which would show, for instance, that Corollary 1(a) is sharp up to a logarithmic factor. We
conjecture that there is an MRP for which the bound (42) holds.
Finally, it is worth commenting on the logarithmic factor that appears in the upper bound
of equation (41). Note that for sufficiently large D, the logarithmic factor is proportional to
log D/ log log D. This is consequence of applying Bennett’s inequality instead of Bernstein’s in-
equality, and we conjecture that the same factor ought to replace the factor log D factor multiplying
the span semi-norm in the upper bound (5b).
A.1 Proof of Proposition 1

In order to prove Proposition 1, it suffices to construct an MRP satisfying condition (40) and
compute its plug-in estimator in closed form. With this goal in mind, suppose that for simplicity
that D is divisible by three, and consider D/3 copies of the 3-state MRP from Figure 3(a). By
µ
construction, we have kσ(θ∗ )k∞ = µ q(1 − q), and kθ∗ kspan = (1−γ) . Setting q = N10D , we see that
p
µ
condition (40) is immediately satisfied with ζ = (1−γ) .
32
It remains to verify the claim (41). Note that the plug-in estimator for this MRP can be
computed in closed form. In particular, it is straightforward to verify that for each state i having
reward µ/2, we have
N (1 − γ) b
d
θplug (i) − θi∗ = Bin(N, q) − N q, (43)
γµ
where we have used the notation θbplug (i) to denote the i-th entry of the vector θbplug . Furthermore,
these D/3 random variables are independent. Thus, the (scaled) `∞ -error of the plug-in estimator is
equal to the maximum absolute deviation in a collection of independent binomial random variables.
Proof of inequality (41), part (a): The following technical lemma provides a lower bound on
the deviation of binomials, and its proof is postponed to the end of this section.
1

Lemma 8. Let X1 , . . . , Xk denote independent random variables with distribution Bin n, 3kn . Let
Yj = Xj − E[Xj ] for each 1 ≤ j ≤ k. Then, we have

4
E max |Yj | ≥ .
1≤j≤k 9
Applying Lemma 8 with k = D/3 in conjunction with the characterization (43), and substituting
our choices of the pair (µ, q) yields
h i 4 ζγ
E kθbplug − θ∗ k∞ ≥ · .
9 N
Proof of inequality (41), part (b): Corollary 3.1(ii) and Lemma 3.3 of Wellner [Wel17] yield, to
the best of our knowledge, the sharpest available upper bound on the maximum absolute deviation
of Bin(n, q) random variables in the regime nq(1 − q) 1:
√

log(1 + k)
E max |Yj | ≤ 12 · if log(1 + k) ≥ 5. (44)
1≤j≤k log log(1 + k)
Combining this bound with the Bernstein bound when k is small, and substituting the various
quantities completes the proof.
1
Proof of Lemma 8: Employing the shorthand q = 3kn , we have

E max |Yj | ≥ (1 − nq) · Pr max Xj ≥ 1
1≤j≤k 1≤j≤k
= (1 − nq) · (1 − (1 − q)nk )
 s 
3nk
2 3 1
≥ · 1 − 1− 
3 3nk
2 −1/3
4
≥ · 1−e ≥ .
3 9
33
B Calculations for the “hard” sub-class
Recall from equation (34) our previous calculation of the value function and standard deviation,
from which we have
p p
∗ p(1 − p) −1 ∗ p(1 − p)
kσ(θ )k∞ = ν(1 − τ ) , k(I − γP) σ(θ )k∞ = ν(1 − τ ) ,
1 − γp (1 − γp)2
4γ−1
and kθ∗ kspan = ν(1 − τ ) 1−γp
1
. Substituting in our choices ν = 1, p = 3γ , and τ = 1 − (1 − γ)α
and simplifying by employing inequality (36), we have
0.5−α 1.5−α 1−α
∗ 1 −1 ∗ 1 ∗ 1
kσ(θ )k∞ ∼ , k(I − γP) σ(θ )k∞ ∼ , and kθ kspan ∼ ,
1−γ 1−γ 1−γ
for each discount factor γ ≥ 12 . Here, the ∼ notation indicates that the LHS can be sandwiched
between two terms that are proportional to the RHS such that the factors of proportionality are
strictly positive and γ-independent.
For the plug-in estimator, its performance will be determined by the maximum of the two terms
1.5−α 2−α
k(I − γP)−1 σ(θ∗ )k∞ kθ∗ kspan

1 1 1 1
√ ∼√ and ∼ .
N N 1−γ (1 − γ)N N 1−γ
1
In the regime N % 1−γ , the first term will be dominant.
References
[AJK19] A. Agarwal, N. Jiang, and S. M. Kakade. Reinforcement learning: Theory and algo-
rithms. Technical Report, CS Department, UW Seattle, 2019.
[AKY20] A. Agarwal, S. M. Kakade, and L. F. Yang. Model-based reinforcement learning with a

generative model is minimax optimal. In Conference on Learning Theory, pages 67–83,
2020.
[AMGK11] M. G. Azar, R. Munos, M. Ghavamzadeh, and H. J. Kappen. Speedy Q-learning. In

Advances in Neural Information Processing Systems, pages 2411–2419, 2011.
[AMK13] M. G. Azar, R. Munos, and H. J. Kappen. Minimax PAC bounds on the sample com-
plexity of reinforcement learning with a generative model. Machine Learning, 91:325–
349, 2013.
[AOM17] M. G. Azar, I. Osband, and R. Munos. Minimax regret bounds for reinforcement
learning. In Proceedings of the International Conference on Machine Learning, 2017.
[BB96] S. J. Bradtke and A. G. Barto. Linear least-squares algorithms for temporal difference
learning. Machine learning, 22(1-3):33–57, 1996.
[BE02] O. Bousquet and A. Elisseeff. Stability and generalization. Journal of machine learning
research, 2(Mar):499–526, 2002.
34
[Ber95a] D. P. Bertsekas. Dynamic programming and stochastic control, volume 1. Athena
Scientific, Belmont, MA, 1995.
[Ber95b] D.P. Bertsekas. Dynamic programming and stochastic control, volume 2. Athena Sci-
entific, Belmont, MA, 1995.
[BGSL08] P. Balakrishna, R. Ganesan, L. Sherry, and B. S. Levy. Estimating taxi-out times with
a reinforcement learning algorithm. In Proceedings of the IEEE/AIAA 27th Digital
Avionics Systems Conference, pages 3–D. IEEE, 2008.
[BM00] V. S. Borkar and S. P. Meyn. The ODE method for convergence of stochastic ap-
proximation and reinforcement learning. SIAM Journal on Control and Optimization,
38(2):447–469, 2000.
[BM11] F. Bach and E. Moulines. Non-asymptotic analysis of stochastic optimization algo-

rithms for machine learning. In Advances in neural information processing systems,
December 2011.
[Bor98] V. S. Borkar. Asynchronous stochastic approximations. SIAM Journal on Control and

Optimization, 36(3):840–851, 1998.
[Boy02] J. A. Boyan. Technical update: Least-squares temporal difference learning. Machine

learning, 49(2-3):233–246, 2002.
[BRS18] J. Bhandari, D. Russo, and R. Singal. A finite time analysis of temporal difference
learning with linear function approximation. arXiv preprint arXiv:1806.02450, 2018.
[BT96] D. P. Bertsekas and J. N. Tsitsiklis. Neuro-dynamic programming. Athena Scientific,

1996.
[CFMW19] Y. Chen, J. Fan, C. Ma, and K. Wang. Spectral method and regularized mle are both
optimal for top-k ranking. The Annals of Statistics, 47(4):2204–2235, 2019.
[CL04] T. T. Cai and M. G. Low. An adaptation theory for nonparametric confidence intervals.
The Annals of statistics, 32(5):1805–1840, 2004.
[CMSS20] Z. Chen, S. T. Maguluri, S. Shakkottai, and K. Shanmugam. Finite-sample anal-

ysis of stochastic approximation using smooth convex envelopes. arXiv preprint
arXiv:2002.00874, 2020.
[DB15] C. Dann and E. Brunskill. Sample complexity of episodic fixed-horizon reinforcement

learning. In Advances in Neural Information Processing Systems, pages 2818–2826,
2015.
[dlPG12] V. de la Pena and E. Giné. Decoupling: from dependence to independence. Springer

Science & Business Media, 2012.
[DM17] A. M. Devraj and S. P. Meyn. Fastest convergence for Q-learning. arXiv preprint
arXiv:1707.03770, 2017.
35
[DMR19] T. T. Doan, S. T. Maguluri, and J. Romberg. Finite-time performance of distributed
temporal difference learning with linear function approximation. arXiv preprint
arXiv:1907.12530, 2019.
[DNP14] C. Dann, G. Neumann, and J. Peters. Policy evaluation with temporal differences: A
survey and comparison. The Journal of Machine Learning Research, 15(1):809–883,
2014.
[DSTM18] G. Dalal, B. Szörényi, G. Thoppe, and S. Mannor. Finite sample analyses for TD(0)
with function approximation. In Thirty-Second AAAI Conference on Artificial Intelli-
gence, 2018.
[Dur99] R. Durrett. Essentials of stochastic processes, volume 1. Springer, 1999.
[EDM03] E. Even-Dar and Y. Mansour. Learning rates for Q-learning. Journal of machine
learning research, 5:1–25, 2003.
[Fel66] W. Feller. An Introduction to Probability Theory and its Applications: Volume II. John
Wiley and Sons, New York, 1966.
[FMP08] J. Frank, S. Mannor, and D. Precup. Reinforcement learning in the presence of rare
events. In Proceedings of the International conference on Machine learning, pages 336–
343. ACM, 2008.
[Gos09] A. Gosavi. Reinforcement learning: A tutorial survey and recent advances. INFORMS
Journal on Computing, 21(2):178–192, 2009.
[HJ85] R. A. Horn and C. R. Johnson. Matrix Analysis. Cambridge University Press, Cam-
bridge, 1985.
[JA18] N. Jiang and A. Agarwal. Open problem: The dependence of sample complexity lower
bounds on planning horizon. In Conference On Learning Theory, pages 3395–3398,
2018.
[JJS94] T. Jaakkola, M. I. Jordan, and S. P. Singh. Convergence of stochastic iterative dynamic

programming algorithms. In Advances in neural information processing systems, pages
703–710, 1994.
[JVHW15] J. Jiao, K. Venkat, Y. Han, and T. Weissman. Minimax estimation of functionals of

discrete distributions. IEEE Transactions on Information Theory, 61(5):2835–2885,
2015.
[JVV86] M. R. Jerrum, L. G. Valiant, and V. V. Vazirani. Random generation of combinatorial

structures from a uniform distribution. Theoretical Computer Science, 43:169–188,
1986.
[Kak03] S. M. Kakade. On the sample complexity of reinforcement learning. PhD thesis, Gatsby
Computational Neuroscience Unit, 2003.
36
[KPR+ 20] K. Khamaru, A. Pananjady, F. Ruan, M. J. Wainwright, and M. I. Jordan. Is tem-
poral difference learning optimal? An instance-dependent analysis. arXiv preprint
arXiv:2003.07337, 2020.
[KS99] M. Kearns and S. Singh. Finite-sample convergence rates for Q-learning and indirect
algorithms. In Advances in neural information processing systems, 1999.
[LH14] T. Lattimore and M. Hutter. Near-optimal PAC bounds for discounted MDPs. Theo-
retical Computer Science, 558:125–143, 2014.
[LL20] G. Lecué and M. Lerasle. Robust machine learning by median-of-means: theory and
practice. Annals of Statistics, 48(2):906–931, 2020.
[LS18] C. Lakshminarayanan and C. Szepesvári. Linear stochastic approximation: How far

does constant step-size and iterate averaging go? In Proceedings of the International
Conference on Artificial Intelligence and Statistics, pages 1347–1355, 2018.
[MMM14] O.-A. Maillard, T. A. Mann, and S. Mannor. How hard is my MDP? the distribution-
norm to the rescue. In Advances in Neural Information Processing Systems, pages
1835–1843, 2014.
[MWCC18] C. Ma, K. Wang, Y. Chi, and Y. Chen. Implicit regularization in nonconvex statistical
estimation: Gradient descent converges linearly for phase retrieval and matrix com-
pletion. In Proceedings of the International Conference on Machine Learning, pages
3345–3354, 2018.
[NJLS09] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation

approach to stochastic programming. SIAM Jour. Opt., 19(4):1574–1609, 2009.
[NY83] A. S. Nemirovsky and D. B. Yudin. Problem Complexity and Method Efficiency in

Optimization. John Wiley and Sons, New York, 1983.
[PJ92] B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by averaging.

SIAM J. Control and Optimization, 30(4):838–855, 1992.
[PPH16] J. Pazis, R. E. Parr, and J. P. How. Improving PAC exploration using the median of
means. In Advances in Neural Information Processing Systems, pages 3898–3906, 2016.
[Put05] M. L. Puterman. Markov decision processes: Discrete stochastic dynamic programming.

Wiley, 2005.
[SB18] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press,

Cambridge, MA, 2nd edition, 2018.
[SJ19] M. Simchowitz and K. G. Jamieson. Non-asymptotic gap-dependent regret bounds for

tabular MDPs. In Advances in Neural Information Processing Systems, pages 1151–
1160, 2019.
[SMT+ 18] M. Simchowitz, H. Mania, S. Tu, M. I. Jordan, and B. Recht. Learning without mixing:
Towards a sharp analysis of linear system identification. In Conference on Learning
Theory, pages 439–473, 2018.
37
[SSM07] D. Silver, R. S. Sutton, and M. Müller. Reinforcement learning of local shape in
the game of Go. In Proceedings of the International Joint Conferences on Artificial
Intelligence, volume 7, pages 1053–1058, 2007.
[Sut88] R. S. Sutton. Learning to predict by the methods of temporal differences. Machine

learning, 3(1):9–44, 1988.
[SWW+ 18] A. Sidford, M. Wang, X. Wu, L. Yang, and Y. Ye. Near-optimal time and sample
complexities for solving Markov decision processes with a generative model. In Advances
in Neural Information Processing Systems, 2018.
[SWWY18] A. Sidford, M. Wang, X. Wu, and Y. Ye. Variance reduced value iteration and faster
algorithms for solving Markov decision processes. In Proceedings of the Symposium on
Discrete Algorithms (SODA), 2018.
[SY19] R. Srikant and L. Ying. Finite-time error bounds for linear stochastic approximation
and TD learning. In Conference on Learning Theory, pages 2803–2830, 2019.
[Sze09] C. Szepesvári. Algorithms for reinforcement learning. Morgan-Claypool, 2009.
[Tad04] V. B. Tadic. On the almost sure rate of convergence of linear stochastic approximation
algorithms. IEEE Transactions on Information Theory, 50(2):401–409, 2004.
[TVR97] J. N. Tsitsiklis and B. Van Roy. Analysis of temporal-diffference learning with function
approximation. In Advances in neural information processing systems, pages 1075–
1081, 1997.
[UKM+ 08] T. Ueno, M. Kawanabe, T. Mori, S. Maeda, and S. Ishii. A semiparametric statistical
approach to model-free policy evaluation. In Proceedings of the International Confer-
ence on Machine Learning, pages 1072–1079, 2008.
[vdV00] A. W. van der Vaart. Asymptotic statistics, volume 3. Cambridge university press,
2000.
[Vin90] A. Vince. A rearrangement inequality and the permutahedron. Amer. Math. Monthly,
97(4):319–323, 1990.
[Wai19a] M. J. Wainwright. High-dimensional statistics: A non-asymptotic viewpoint. Cam-

bridge University Press, Cambridge, UK, 2019.
[Wai19b] M. J. Wainwright. Stochastic approximation with cone-contractive operators: Sharper

`∞ -bounds for Q-learning. arXiv preprint arXiv:1905.06265, 2019.
[Wai19c] M. J. Wainwright. Variance-reduced Q-learning is minimax optimal. arXiv preprint

arXiv:1906.04697, 2019.
[Wel17] J. A. Wellner. The Bennett-Orlicz norm. Sankhya A, 79(2):355–383, 2017.
[WOS+ 03] T. Weissman, E. Ordentlich, G. Seroussi, S. Verdu, and M. J. Weinberger. Inequalities

for the L1 deviation of the empirical distribution. Hewlett-Packard Labs, Tech. Rep,
2003.
38
[WW20] Y. Wei and M. J. Wainwright. The local geometry of testing in ellipses: Tight con-
trol via localized Kolmogorov widths. IEEE Transactions on Information Theory,
66(8):5110–5129, 2020.
[Yu94] B. Yu. Rates of convergence for empirical processes of stationary mixing sequences.
The Annals of Probability, pages 94–116, 1994.
[ZB19] A. Zanette and E. Brunskill. Tighter problem-dependent regret bounds in reinforcement

learning without domain knowledge using value function bounds. In Proceedings of the
International Conference on Machine Learning, volume 97, pages 7304–7312, 2019.
[ZCDL16] Y. Zhu, S. Chatterjee, J. Duchi, and J. Lafferty. Local minimax complexity of stochastic
convex optimization. In Advances in Neural Information Processing Systems, pages
3431–3439, 2016.
[ZKB19] A. Zanette, M. J. Kochenderfer, and E. Brunskill. Almost horizon-free structure-aware

best policy identification with a generative model. In Advances in Neural Information
Processing Systems, pages 5625–5634, 2019.
39

Instance-Dependent ' - Bounds For Policy Evaluation in Tabular Reinforcement Learning

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Instance-Dependent ' - Bounds For Policy Evaluation in Tabular Reinforcement Learning

Uploaded by

Copyright:

Available Formats

Instance-dependent `∞-bounds for policy evaluation in

tabular reinforcement learning

? Simons Institute for the Theory of Computing, UC Berkeley

September 17, 2020

[KS99], [Kak03] Synchronous Non-asymptotic Global, `∞ Hoeffding

Figure 1: A simple 3-state Markov reward process.

2 Background and Problem Formulation

2.1 Markov reward processes and value functions

2.2 Observation model

σ 2 (θ) = E |(Z − P)θ|2 ,

kθkspan : = max θ(x) − min θ(x).

then each of the following statements holds with probability at least 1 − δ.

3.2 Is the plug-in approach optimal?

with probability at least 1 − δ.

with probability at least 1 − δ.

• Second, compute the empirical mean µ 1 P

• Finally, return the quantity M(θ)

The random operator M

TbNMoM (θ) : = rb + γ M(θ).

with probability exceeding 1 − δ.

and is achieved (up to constant factors) by the estimator θbMoM .

log -norm error

log -norm error

|Av|  A|v|. (13b)

5.1 Proof of Theorem 1, part (a)

b : = θb − θ∗ and re-arranging implies the relation

and consequently, the elementwise inequality

with probability at least 1 − 4δ , and hence

with probability at least 1 − 2δ .

b −1 has non-negative entries, we can multiply both sides of the ele-

Since the span semi-norm satisfies the triangle inequality, we have

kθ∗ kspan ≤ kθk

Substituting this bound and re-arranging yields

ing the pre-factor c1 in the lower bound N ≥ c1 γ 2 log(8D/δ)

5.1.1 Proof of Lemma 1

σ(θ∗ )  kθ∗ kspan 1.

σ 2 (θ∗ ) = E |θ∗ − E[θ∗ ]|2

We claim that the terms T1 and T2 are bounded as follows:

Putting together the pieces yields the elementwise inequality

with probability at least 1 − δ/4, and hence

5.2 Proof of Theorem 1, part (b)

5.2.1 Proving the bound (21)

which implies that

 γ(I − γP)−1 |(P

Returning to the upper bound (23), we have shown that

Under the assumed lower bound on the sample size N ≥ c0 γ 2 log(8D/δ)

Lemma 3. Suppose that the sample size is lower bounded as N ≥ c0 γ 2 log(8D/δ)

This lemma is proved in Section 5.2.3 to follow.

5.2.2 Proof of Lemma 2

Moreover, the Woodbury matrix identity [HJ85] yields

b (i) )−1 (r − rb), applying the Hoeffding bound yields

with probability exceeding 1 − δ/(4D).

Finally, applying part (a) of Lemma 2 completes the proof.

Finally, the triangle inequality yields

5.3 Proof of Corollary 1

We also have the bound

Finally, when N ≥ c1 log(8D/δ)

thereby establishing the claim.

5.4 Proof of Theorem 2

MI (rmax , %) ⊆ Mvar (ϑ, %) and MI (rmax , %) ⊆ Mrew (rmax , %),

5.4.2 Proof of claim (33)

respectively. See Figure 2 for an illustration of this MRP.

1−γ 1−γ 1−γ

|Av| A|v|. (13b)

σ(θ∗ ) kθ∗ kspan 1.

γ(I − γP)−1 |(P

ζ (a)h i (b) ζ log(1 + D/3)