You are on page 1of 69

Modeling Bellman-error with Logistic Distribution with

Applications in Reinforcement Learning

Outongyi Lva,b , Bingxin Zhoua , Lin F. Yangc


a Institute
of Natural Sciences, Shanghai Jiao Tong University, Shanghai, China
b Schoolof Mathematical Sciences, Shanghai Jiao Tong University, Shanghai, China
c Department of Electrical and Computer Engineering, University of California, Los

Angeles, Los Angeles, America

Abstract

In modern Reinforcement Learning (RL) approaches, optimizing the Bellman


error is a critical element across various algorithms, notably in deep Q-Learning
and related methodologies. Traditional approaches predominantly employ the
mean-squared Bellman error (MSELoss) as the standard loss function for neural
network training, often without considering the actual distribution of Bellman
errors. However, while the MSE estimator is statistically optimal under a Gaus-
sian error distribution, it may not adequately capture the essential characteris-
tics of RL applications when this assumption may not hold. In this paper, we
re-examine this foundational aspect by investigating the distribution of Bellman
errors in RL training. Our study reveals that the Bellman approximation error
tends to follow the Logistic distribution, rather than the commonly assumed
Normal distribution. This insight led us to propose the use of the Logistic max-
imum likelihood function (LLoss) as an alternative to MSELoss. We rigorously
tested this hypothesis through extensive numerical experiments in diverse online
and offline RL environments. Our findings show that integrating the Logistic
correction into loss functions of various baseline RL methods consistently yields
superior performance compared to their MSE counterparts. Additionally, we
employed Kolmogorov–Smirnov tests to further substantiate that the Logistic
distribution provides a more accurate fit for approximating Bellman errors. Our
research also makes a novel theoretical contribution by drawing a clear connec-
tion between the distribution of Bellman error and the practice of proportional

Preprint submitted to Neural Networks February 13, 2024


reward scaling, a common technique for performance enhancement in RL. This
relationship is explored through a comprehensive distribution-based analysis.
Furthermore, we delve into the sample-accuracy trade-off involved in approxi-
mating the Logistic distribution, utilizing the concept of Bias–Variance decom-
position. The theoretical and empirical insights offered in this study lay a signif-
icant foundation for future research, potentially leading to enhanced methodolo-
gies and understandings in RL, particularly focusing on the distribution-based
aspects of Bellman error optimization.
Keywords: Reinforcement Learning, Logistic Distribution, Reward Scaling,
Bellman Error

1. Introduction

Reinforcement Learning (RL) has emerged as a dynamic and transforma-


tive field within artificial intelligence, aimed at empowering agents to interact
intelligently with their environments to achieve the highest possible cumulative
5 rewards. This ambition has propelled RL to the forefront of technological ad-
vancements, manifesting in its successful application across a spectrum of areas.
Notably, RL algorithms have demonstrated profound capabilities in mastering
strategic games [1, 2, 3], where they devise complex strategies and adapt to
opponents’ moves with remarkable proficiency. Beyond the gaming domain, RL
10 has also tackled real-world logistical problems with significant success, such as
optimizing routes for capacitated vehicle routing [4, 5, 6], showcasing its util-
ity in solving problems that require sophisticated planning and decision-making
under uncertainty.
At the core of these advancements is the Bellman equation [7], a princi-
15 ple that underpins many Q-Learning algorithms and serves as a guidepost for
achieving optimal or near-optimal solutions in RL tasks. The Bellman equation
articulates a recursive relationship, establishing that the value of a state under
an optimal policy is equal to the maximum expected return from that state, con-
sidering immediate rewards and subsequent states’ values. This recursive nature

2
Figure 1: Evolving distributions of Bellman error (as defined by (14)) at different
epochs in the online LunarLanderContinuous-v2 environment.

20 is pivotal for understanding the dynamics of decision-making in RL, enabling


agents to evaluate the long-term consequences of their actions systematically.
Among these Q-Learning-based algorithms, the Soft Actor Critic (SAC) method
[8, 9] has utilized the soft Bellman operator to significantly enhance model per-
formance and stability in Online RL. These advancements, alongside other re-
25 finements [10, 11, 12], represent a significant stride in RL techniques. Similarly,
in offline RL, the discovery of substantial overestimation in Q-value estimations
led to the development of the Conservative Q-Learning (CQL) framework [13],
prompting further advancements in the field [14, 15, 16, 17].
A crucial aspect in optimizing these Q-Learning methods is the minimization
30 of Bellman error [18], essential for accurately representing the value function of
state-action pairs. In these approaches, the minimization of Bellman error has
relied on the mean-squared Bellman error (MSELoss), which, while being the
statistically optimal estimator under the normal distribution of errors, may not
act optimally under other distributions. In some recent advances, such as [17],
35 which explore the possibilities of applying a new loss function of optimizer on
the Bellman errors and obtain better performance than the MSELoss. In this
paper, attempt to answer a critical question:
“Which distribution more accurately characterizes the Bellman error in diverse
RL settings?”.
40 Based on a comprehensive analysis, we discover that the Logistic distribu-
tion, rather than the Normal distribution, more accurately characterizes Bell-
man errors in various RL environments. This insight is both theoretically inno-

3
vative and practically significant, as it informs the design of more effective RL
algorithms. Our research makes significant contributions in several key areas:

45 • We identify a characteristic Logistic distribution for Bellman errors, chal-


lenging the traditional belief in a Normally distributed Bellman error. Our
findings are supported by both theoretical proofs and empirical evidence from
numerical experiments.

• By exploring the sampling error of the Logistic distribution using the Bias-
50 Variance decomposition, we provide practical guidelines for optimal batch
sizing in neural network training, enhancing computational efficiency.

• Through extensive testing in eight online and nine offline RL environments,


we confirm the robustness of the Logistic distribution preference for Bellman
error, corroborated by Kolmogorov–Smirnov tests.

55 Furthermore, our study of the distribution of Bellman errors leads to an in-


triguing insight into the issue of proportional reward scaling in RL. Empirically,
adjusting the reward magnitude with a scaling factor has been a common strat-
egy to improve training effectiveness [19, 20]. Our research not only confirms the
validity of this practice but also provides a theoretical basis for determining the
60 optimal bounds of scaling factors, a critical aspect that has been underexplored
in the literature. This additional finding exemplifies how a deeper understand-
ing of Bellman error distribution can have broader implications in the field of
RL.
The structure of this paper is as follows: Section 2 introduces some related
65 work that is pertinent to our study. Section 3 presents the important funda-
mental definitions and basic algorithms in the realm of RL. Section 4-7 is our
main contribution. Section 4 conducts the Bellman error analysis on Logistic
distribution under Gumbel and Normal initialization. Section 5 analyses the
natural connection between Logistic distribution and the reward scaling prob-
70 lem. Section 6 gives the method for sampling from the Logistic distribution.
Section 7 provides an alternative formulation for MSELoss. Section 8 conducts

4
the numerical experimental analysis and ablation study for our method. Finally,
we summarize this paper in Section 9 and discuss the directions for future work.

2. Related Work

75 In this section, we will introduce some recent research that is relevant to our
work. We will summarize them in two parts. Regarding the Bellman equation
and the Bellman error, Extreme Q-Learning (XQL) [17] defines a novel sample-
free objective towards optimal soft-value functions in the maximum entropy RL
setting using the Gumbel distribution, they use the maximum likelihood func-
80 tion of the Gumbel distribution to avoid directly sampling the maximum en-
tropy. Their frameworks mark a significant departure from established practices
and offer exciting prospects for advancements in RL optimization techniques.
For example, Implicit Diffusion Q-learning (IDQL) [21] used the samples from
a diffusion-parameterized behavior policy to attain better results. Inverse Pref-
85 erence Learning [22] is proposed for learning from offline data without the need
for learning the reward. PROTO [23] is proposed to overcome some limitations
and achieve superior performance. In addition to this, Some researchers [24]
expressed skepticism towards MSELoss because it is a non-convex function and
employed an improved convex loss function as an alternative. However, they
90 did not fundamentally elucidate the inadequacies of the Normal distribution be-
hind MSELoss. The subsequent papers[25, 26, 27] made improvements in terms
of convexity. For [25], they express skepticism about the scheme of the Bell-
man equation, arguing from a convexity standpoint why the Bellman equation
may not be an ideal objective function, for [26] and [27], they focus on fur-
95 ther optimizing and enhancing convex objective functions, aiming to illustrate
that direct optimization using MSELoss combined with Bellman error is incor-
rect. However, none of them explain the inherent issues of the MSELoss from
a distributional perspective. Additionally, recent work [28] has also highlighted
the pessimistic outcomes of MSELoss on Bellman error performance for offline
100 RL and provided a reasoned explanation from an offline perspective under the

5
distribution of the relevant dataset.
The issue of reward scaling, which can be regarded as one of the parts
of reward shaping, is also one of the focal points in RL. Many scholars have
conducted numerous studies on this topic. Some researchers [29] gave the scaling
105 rules and scaling functions within the experience. Alternatively, some scholars
link the scaling problem to sampling complexity, aiming to enhance sample
sampling efficiency by setting scaling functions [30]. In addition to this, some
individuals choose to bypass manual scaling and instead focus on learning the
reward function [31, 32]. While these approaches to some extent shed light on
110 the issues of reward scaling and the agent performance, they did not explicitly
connect reward scaling with the Bellman error objective from a distributional
perspective and explain the reason for a saturation upper bound during scaling.

3. Preliminaries

This section provides a concise introduction to foundational concepts in RL.


115 Section 3.1 presents basic definitions and notations of RL. Section 3.2 defines
the target of RL, following a detailed derivation in 3.3 for Q-Learning. Sec-
tion 3.4 outlines the fundamental definitions and key properties of the Logistic
and Gumbel distributions.

3.1. Concepts and Notations

120 RL explores the expected cumulative reward by a Markov decision process


defined by a tuple (S, A, P, r, γ), where S and A respectively denote the state
and action space. P(s′ |s, a) is the state transition probability from state s that
drives toward the next state s′ . Here r defines the reward of taking an ac-
tion a at the current state s. For arbitrary state s ∈ S, the reward obtained
125 by performing arbitrary action a ∈ A is defined as r(s, a). γ ∈ (0, 1) is the
discount factor on future rewards. Online and offline RL differs mainly from
the distinct behaviors of interaction between an environment and the agent. In
online training, the agent is able to interact with the environment and receive

6
immediate feedback. The agent is expected to learn progressively from these
130 feedbacks to facilitate the optimal strategies. In contrast, the interaction be-
tween the agent and the environment is unavailable in offline training, in which
case the agent will be advised to learn from a large offline dataset to recognize
intrinsic patterns that are expected to be generalized to similar environments.
While enhancing the generalizability is a more difficult problem in general, the
135 performance of offline RL is significantly inferior to their online counterparts.

3.2. Objectives in Reinforcement Learning

The target of RL, as defined by the Actor-Critic (AC) algorithm [33], is to


find the optimal policy π(a|s) that maximizes the cumulative discounted reward
at a fixed horizon T , i.e., the finite-horizon discounted objective:
" T #
X
Eat ∼π(at |st ) γ t r(st , at ) . (1)
t=0

140 Alternatively, Soft AC (SAC) [8, 9] encompasses soft conditions in future re-
wards to learn the policy π with a regularization strength ζ that maximizes:
" T #
X
Eat ∼π(at |st ) γ t (r(st , at ) − ζ log(π(at |st ))) . (2)
t=0

A more general version proposed in [17] takes the Kullback-Leibler divergence


(KL divergence) between the policy π and the prior distribution of a reference
distribution µ to augment the reward function in the objective:
" T #
X
t π(at |st )
Eat ∼π(at |st ) γ (r(st , at ) − ζ log ) . (3)
t=0
µ(at |st )

145 The reference distribution µ(a|s) follows different sampling conventions in dif-
ferent types of RL to fit the behavioral policy [34]. Specifically, in online RL, it
is usually sampled from a uniform distribution, while in offline RL, it is usually
sampled from the empirical distribution of the offline training data.

7
3.3. (Soft) Bellman Equation

150 The cumulative discounted reward can be used to formulate the optimal
Bellman iterative equation for Q-learning [35]. It is defined as:

Q∗ (s, a) = r(s, a) + maxa′ (Q∗ (s, a′ )). (4)

For conciseness, we derive the equation from (3). The same method can be
directly applied to the other two variants in (1) and (2). All these objectives
rely on the Bellman iteration, which can be inspired from (4), i.e.,

Qt+1 (s, a) = r(s, a) + γ max



(Qt (s′ , a′ )). (5)
a

155 The analysis in this research is based on (5). However, for completeness, we
also introduce other update methods. Consider the general form of the optimal
Bellman iterative equation from (3), which reads:
2
π(a′ |s′ )
  
k+1 ′ ′ k
Q (s, a) ← arg min r(s, a)+Es ∼P(·|s,a),a ∼π Q(s , a ) − ζ log
′ ′ −Q (s, a) .
Q µ(a′ |s′ )
(6)
The correspondence solution to the Bellman iteration with respect to (3) is then:

π(a′ |s′ )
 
k+1 k ′ ′
Q (s, a) = r(s, a) + Es′ ∼P(·|s,a),a′ ∼π Q (s , a ) − ζ log . (7)
µ(a′ |s′ )

To take the optimal strategy with the maximum Qt (s′ , a′ ) in (5), the corre-
160 sponding π ∗ has to satisfy:

π(a′ |s′ )
 
∗ ′ ′ k ′ ′
π (a |s ) = arg max(Es′ ∼P(·|s,a),a′ ∼π Q (s , a ) − ζ log ), (8)
π µ(a′ |s′ )

where a′ π ∗ (a′ |s′ ) = 1. Applying the Lagrange multiplier method [36] (see
P

Appendix A.8), we have:


k
∗ µ(a|s)eQ (s,a)/ζ
π (a|s) = P Qk (s,a)/ζ
. (9)
a µ(a|s)e

Consequently, simplifying (8) by (9) yields:

π ∗ (a′ |s′ )
   
′ ′ Qk (s′ ,a′ )/ζ
X
Es′ ∼P(·|s,a),a′ ∼π∗ Qk (s′ , a′ ) − ζ log → E ′
s ∼P(·|s,a) ζ log µ(a |s )e .
µ(a′ |s′ )
a′
(10)

8
The maxa′ Qk (s′ , a′ ) in (5) with respect to the optimal policy π ∗ is:

k
(s′ ,a′ )/ζ
X
max

Qk (s′ , a′ ) = Es′ ∼P(·|s,a) [ζ log µ(a′ |s′ )eQ ]. (11)
a
a′

165 While it is challenging to estimate the log sum in (11), the XQL [17] employed
a Gumbel regression-based approach to circumvent the need for sampling esti-
mation.

3.4. Gumbel and Logistic Distribution

Before delving into the subsequent theoretical analysis, we introduce the


170 probability density function (PDF), cumulative distribution function (CDF),
and expectation of the Gumbel distribution and the Logistic distribution.

Definition 1 (Gumbel Distribution). If a random variable x follows a Gumbel


distribution, i.e., x ∼ Gumbel(λ, η) with the location parameter λ and posi-
tive scale parameter η, then its expectation is λ + ηv, where v ≃ 0.58 is the
175 Euler–Mascheroni constant. The corresponding PDF and CDF are:
  
1 x−λ x−λ
PDF : p(x) = exp − + exp(− ) ,
η η η
 
x−λ
CDF : P (x) = exp − exp(− ) .
η

Definition 2 (Logistic Distribution). If a random variable x follows a Logistic


distribution , i.e., x ∼ Logistic(λ, η) with the location parameter λ and positive
scale parameter η, then its expectation is λ. The corresponding PDF and CDF
are:

1 exp (−(x − λ)/η) 1


PDF : p(x) = , CDF : P (x) = .
η (1 + exp (−(x − λ)/η))2 1 + exp (−(x − λ)/η)

180 4. Characterization of Bellman Error with Logistic Distribution

In this section, we conducted our analyses under the most basic settings,
which means that we disregarded the impact of the reward distribution from the
dataset, neglected the influence of state transition probabilities, and imposed
a finite action space. Our purpose is to demonstrate that if we acknowledge

9
185 the suitability of using the Normal distribution or the Gumbel distribution for
Bellman error under all conditions, then modeling Bellman error with these
distributions should theoretically and experimentally be interpretable in such
a straightforward basic setting. However, we will show that the Bellman error
no longer conforms to either of these distributions, but rather follows a biased
190 Logistic distribution, as experimentally supported in Figure 1.
The specific structure of this section is outlined as follows. We initiate our
exploration in Section 4.1 by defining the Bellman error with parameterization θ
and analyzing the distribution of the Bellman error under Gumbel initialization.
While the Gumbel initialization is not applied as commonly as Normal initial-
195 ization in practice, we present the formulation of the Normal approximation
for the Gumbel distribution in Section 4.2. This approximation allows for the
substitution of Gumbel initialization in Section 4.1 with Normal initialization.

4.1. Gumbel Initialization Approximation for Logistic Bellman Error

As mentioned at the beginning of Section 4, we aim to analyze the distri-


200 bution of Bellman error under the most basic scenario. As a foundational basis
for subsequent Q-Learning algorithms, it is crucial to highlight that (4) and (5)
serve as the keystone equation. Consequently, our analysis is conducted based
on (4) and (5). We do not analyze the soft update (7) in this paper.
We now delve into the exact updating process. As per (5), providing the
205 initial values is imperative for initiating the iteration. Thus, we designate the
t-th iteration value associated with the pair (s, a) as Q̂t (s, a) and take Q̂0 (s, a)
with t = 0 as the start. Denote Q∗ (s, a) as the optimal solution for (5), it
should satisfy the optimal Bellman equation introduced in (4):

Q∗ (s, a) = r(s, a) + γ max



(Q∗ (s′ , a′ )).
a

As is well known that there exists a gap between the iterated values and true
210 values. We now define the error between Q∗ (s, a) and Q̂t (s, a) as ϵt (s, a), which
means:
Q̂t (s, a) = Q∗ (s, a) + ϵt (s, a). (12)

10
According to the Bellman iteration in (5), each Q̂t (s, a) can be obtained through
iterative update from the initialization Q̂0 (s, a). We will show in Lemma 4
that the random variable ϵt (s, a) follows a biased Gumbel distribution under
215 Assumptions 1-3.
While (5) is capable of updating tabular Q-values, complex environments
often employ neural networks to parameterize the Q-function. We thus param-
eterize Q̂ and ϵ by θ. We refine the Q-function as Q̂θ (s, a) for the (s, a) pair
and represent the gap ϵt (s, a) in (12) as ϵθ (s, a), which revise in (12) to:

Q̂θ (s, a) = Q∗ (s, a) + ϵθ (s, a). (13)

220 We now define the parameterized Bellman error εθ as


 
′ ′
εθ (s, a) = r(s, a) + γ max

Q̂θ (s , a ) − Q̂θ (s, a). (14)
a

Notably, this parameterization focuses solely on εθ (s, a) generated by (13). We


omit potential errors from other parameterization aspects, such as optimizer,
gradient updating methods, or potential errors introduced by the network ar-
chitecture and other unquantifiable errors. As supported by empirical evidence
225 in Section 7, neglecting these additional errors would not affect the validity of
our theory.
Below we present Lemmas 1-6 associated with ϵt (s, a) and εθ (s, a). The
complete proofs are in Appendix A.1-Appendix A.5. In particular, we will
show in Lemma 4 that the distribution of ϵt (s, a) has a non-zero mean and is
230 time-dependent. We commence our analysis under these four assumptions:

Assumption 1. The action space A contains a finite number of n elements,


i.e., |A| = n.

We will see the reasons for Assumption 1 in the upcoming Lemma 3. The
reason for this assumption is an infinite action space does not necessarily guar-
235 antee the effectiveness of the max operator for Gumbel distribution. In fact, this
assumption can be considered standard because it is quite common in practical
problems [37, 38].

11
Assumption 2. There is an injection mapping T : (s, a) → s′ , such that the
next state s′ is uniquely determined by the current state s and action a.

240 Assumption 2 is for the convenience of our theoretical analysis, as the state
transition probability P(s′ |s, a) can affect the analysis of the error distribution
that is not conducive to our analysis. This assumption aligns with practical
scenarios, especially when disregarding state transition probabilities [39, 40].
Therefore, we can also consider this assumption as standard.

245 Assumption 3. The initial Q̂0 (s, a) follows the same Gumbel distribution that
is independent from (s, a) pairs.

We will show in Lemma 1 and 3 that the direct way to obtain a true Gumbel
distribution and continue its type during iterations is to assume the initial state
is a Gumbel distribution. Lemma 1 shows that under finite conditions, obtaining
250 a true Gumbel distribution using other distributions is impossible.
In fact, using the Gumbel initialization is often not common. Instead, we
tend to prefer the Normal initialization, so a direct observation of Assumption 3
is not standard, it is a strict assumption here. Although the Gumbel initial-
ization does not align with our practical understanding, in Section 4.2, we will
255 provide a method to replace the Gumbel initialization with the Normal initial-
ization and give the standard Assumption 3∗ . Hence, Assumption 3 can also
be considered as a standard assumption.

Assumption 4. The discount factor γ is very close to 1, this implies that


we place greater importance on future rewards, aiding the agent in long-term
260 decision-making, this can be expressed with 1 − γ ≤ κ0 , where κ0 ≥ 0 is very
close to 0.

We will see in Theorem 1 that Assumption 4 is necessary, as it ensures


the correctness of the Logistic distribution. Assumption 4 is also considered
standard. Because in the majority of conditions, we often set the discount
265 factor γ to be 0.99 or 0.95 [41, 42] to achieve effective training results.

12
In summary, Assumption 1-4 appear to be standard. Therefore, such as-
sumptions are reasonable. Based on the assumptions above, we have the asso-
ciated lemmas.

Lemma 1. [43] For i.i.d. random variables X1 , . . . , Xn ∼ f (X), where f (X)


has the exponential tails, let Mn = max({X1 , . . . , Xn }). If there exists two
constants an , bn with respect to size n, where an > 0, satisfying:
 
M n − bn
lim P ≤ x = G(x),
n→∞ an
−x
then G(x) is the CDF of the standard Gumbel distribution, i.e., G(x) = e−e .

270 The key idea of Lemma 1 is that obtaining the true Gumbel distribution
using the maximum operator under a finite sample size n is generally impos-
sible. However, we can approximate the Gumbel distribution under Normal
conditions. We will delve into this discussion in Section 4.2. The following
Lemma 2-3 describe the basic properties of the Gumbel distribution.

275 Lemma 2. If a random variable X ∼ Gumbel(A, B) follows Gumbel distri-


bution with location A and scale B, then X + C ∼ Gumbel(C + A, B) and
DX ∼ Gumbel(DA, DB) with arbitrary constants C ∈ R and D > 0.

Lemma 3. For a set of mutually independent random variables Xi ∼ Gumbel(Ci , β)


(1 ≤ i ≤ n), where Ci is a constant related to Xi and β is a positive constant,
Pn 1
280 then maxi (Xi ) ∼ Gumbel(β ln i=1 e β Ci , β).

The key idea of Lemma 2 indicates that the Gumbel distribution maintains
its distributional type under linear transformations. Lemma 3 demonstrates
that a sequence of independent Gumbel distributions scaled with the same con-
stant maintains their distributional type with the maximum operation.
285 It is worth noting that in conjunction with the Assumption 1 and 3, Lemma 3
shows us that γ maxa′ (Q̂0 (s′ , a′ )) ∼ Gumbel(C1 , β1 ) with constants C1 ∈ R, β1 >
0 that are determined by the initialization and are independent from (s, a) pair.
Based on this analysis, we next propose Lemma 4 to establish the relationship
between Q∗ and Q̂t for ϵt (s, a) defined in (12).

13
290 Lemma 4. For ϵt (s, a) defined in (12), under Assumptions 1-3, we show that:

ϵt (s, a) ∼ Gumbel(Ct (s, a) − γ max



(Q∗ (s′ , a′ )), βt ),
a

where
C1 (s, a) = C1 ,
n
X r(s′ ,ai )
C2 (s, a) = γ(C1 (s, a) + β1 ln e β1
),
i=1

and
n
X r(s′ ,ai )+Ct−1 (s′ ,ai )

Ct (s, a) = γ(βt−1 ln e βt−1


)(t ≥ 3).
i=1

For βt , it always holds that

βt = γ t−1 β1 (t ≥ 1).

Besides, ϵt (s, a) are independent for arbitrary pairs (s, a).

Remark 1. There is a special case when the Gumbel distribution follows a sim-
pler expression. For ∀s1 , s2 , define two sets S1 = [r(s1 , a1 ), r(s1 , a2 ), ..., r(s1 , an )]
and S2 = [r(s2 , a1 ), r(s2 , a2 ), ..., r(s2 , an )]. If S1 △S2 = ∅, then

ϵt (s, a) ∼ Gumbel(Ct − γ max



(Q∗ (s′ , a′ )), βt ).
a

295 with
n
X r(s′ ,ai )
Ct = γ(Ct−1 + βt−1 ln e βt−1
)(t ≥ 2),
i=1

βt = γ t−1 β1 (t ≥ 1).

The key idea of Lemma 4 indicates that under Assumptions 1-3, ϵt (s, a)
follows a Gumbel distribution, where the location parameter is associated with
the (s, a) pair, and the scale parameter is time-dependent. which contradicts
the assumption in [17] that E[ϵt (s, a)] = 0. This suggests that the independent
300 unbiased assumption of ϵt (s, a) in [17] is not adequately considered. Before
delving into the new theorem for Bellman error, we need Lemmas 5-6.

Lemma 5. For random variables X ∼ Gumbel(CX , β) and Y ∼ Gumbel(CY , β),


if X and Y are independent, then (X − Y ) ∼ Logistic(CX − CY , β).

14
The key idea of Lemma 5 shows that subtracting two Gumbel distributions
305 with the same scale parameter results in a Logistic distribution. We will see
later that it plays a crucial role in the proof of Theorem 1. It is important
to note that X + Y will no longer follow the Logistic distribution. However,
it can be approximated by Generalized Integer Gamma (GIG) or Generalized
Near-Integer Gamma (GNIG) distributions [44].

Lemma 6. If X ∼ Gumbel(A, 1), then both E[e−X ] and E[Xe−X ] are bounded:

20 1
1 1
E[e−X ] < ( 2
+ 10e−e 2 + − )e−A .
e 2 2e

When A > 0:

3 20 1
1 1
E[Xe−X ] < ( + A( 2 + 10e−e 2 + − ))e−A .
20 e 2 2e

When A ≤ 0:
3 −A
E[Xe−X ] < ( )e .
20
310 Lemma 6 provides the bounds for E[e−X ] and E[Xe−X ], these bound are
also prepared for Theorem 1. Note that the bounds presented in Lemma 6 are
upper bounds and do not represent the supremum.
Next, We will present Theorem 1 that defines the Logistic distribution of
the Bellman error εθ (s, a) (formulated in (14)). We parameterize Ct and βt in
315 Lemma 4 as Cθ and βθ , respectively. Theorem 1 is formulated as follows:

Theorem 1 (Logistic distribution of Bellman error). The Bellman error εθ (s, a)


approximately follows the Logistic distribution under the Assumptions 1-4. The
degree of approximation can be measured by the upper bound of KL divergence
between:
n
X r(s′ ,ai )+Cθ (s′ ,ai )
X ∼ Gumbel(βθ ln e βθ
, βθ ).
i=1

and
n
X r(s′ ,ai )+Cθ (s′ ,ai )
Y ∼ Gumbel(γβθ ln e βθ
, γβθ ).
i=1

Pn r(s′ ,ai )+Cθ (s′ ,ai )


Let A∗ = ln i=1 e βθ
, we have these conclusions:

15
1
1. If A∗ > 0, then KL(Y ||X) < log( γ1 )+(1−γ)[A∗ ( 20
e2 +10e
−e 2
− 12 − 2e
1 3
)+ 20 −v].

2. If A∗ ≤ 0, then KL(Y ||X) < log( γ1 ) + (1 − γ)[ 20


3
− A∗ − v].

1
3. The order of the KL divergence error is controlled at O(log( 1−κ 0
) + κ0 A∗ ).

320 If the upper bound of KL divergence is sufficiently small. Then εθ (s, a) follows
the Logistic distribution, i.e.,
n
X r(s′ ,ai )+Cθ (s′ ,ai )
εθ (s, a) ∼ Logistic(Cθ (s, a) − βθ ln e βθ
, βθ ).
i=1

Remark 2. For the special case discussed in Remark 1, εθ (s, a) satisfies:


n
X r(s′ ,ai )
θ
ε (s, a) ∼ Logistic(−βθ ln e βθ
, βθ ).
i=1

The proof of Theorem 1 can be found in Appendix A.6.


The key idea of Theorem 1 indicates that within a certain range of KL
325 divergence, the Bellman error can be effectively modeled by a Logistic distri-
bution. Given that the discount factor γ is typically set to a value close to
1 (e.g., 0.99), as long as A∗ is not excessively large, the KL divergence range
will be sufficiently small. For instance, in Figure 2 (a), the upper bounds of
KL divergence for different A∗ demonstrates that the small A∗ generally corre-
330 spond to small KL divergence values. It is important to note that the Logistic
distribution holds only for small A∗ . When A∗ becomes considerably large,
the upper bound of KL divergence would increase to an inefficient stage (e.g.,
A∗ = 100 results in KL(Y ||X) < 13 in Figure 2 (a) ), where the impact of
the discount factor becomes significant. Hence, employing the Logistic approx-
335 imation proves advantageous over a Normal approximation for a more effective
training trajectory. This advantage is particularly noticeable during the early
stages of training. As depicted in Figure 2 (b), formulating the Bellman error
with a Logistic distribution in Pendulum-v1 expedites more effective training
for the average reward. This phenomenon correlates with the large βθ value as-
340 sociated with the substantial gaps in different rewards during the early stages,
leading to a remarkably small range of KL divergence. We additionally employ

16
Figure 2: (a) The upper bound of KL divergence with respect to different A∗ ratios
following Theorem 1. (b) Demonstration with Pendulum-v1 training. The average
reward by Implicit-Q-Learning, with the Bellman error modeled by Normal and Lo-
gistic distributions, respectively.

1
the Deep Q Learning (DQN) to illustrate the phenomenon observed in our
experiments that LLoss can assist the agent in reaching a higher average reward
level during the early stages of training, thereby expediting the agent’s speed in
345 finding the optimal solution. The results are in Figure 3. We delve further into
the discussion in Section 8.

Figure 3: The DQN training results in two discrete action environments Lunar Lan-
der and CartPole-v1, the Logistic loss function can significantly reduce the training
cost in the early stages and greatly accelerate the training process.

When optimizing the Bellman error, its expectation is anticipated to con-

1 https://github.com/hungtuchen/pytorch-dqn

17
verge toward zero, i.e.,  
θ
E ε (s, a) → 0. (15)

However, Theorem 1 shows that the distribution of εθ (s, a) is proven to


350 be biased with an increasing kurtosis over time (For instance, see Figure 1).
Establishing a direct approximation for the Bellman error is challenging due
to its expectation being related to the (s, a) pair, it is nearly impractical when
updating parameters in neural networks. To simplify this challenge, in Section 7,
we streamline this challenge by extracting Bellman error samples from the same
355 biased Logistic distribution within the same batch. This approach has also been
experimentally validated as feasible. More details will be provided in Section 7.
Another notable analysis is related to the Assumption 3. Directly initializing
neural networks to the Gumbel distribution is generally impractical, whereas the
Normal distribution is more commonly used for initialization. To address this,
360 we next discuss the finite approximation for the Gumbel distribution under
Normal distribution initialization.

4.2. Normal Initialization Approximation for Logistic Bellman Error

The theoretical validation of Theorem 1 and Lemma 4 in Section 4.1 are


based on Assumption 3 (see Appendix A.3 and Appendix A.6 ). While this
365 assumption contradicts the prevalent practice in neural networks, which com-
monly employ Normal initialization, our purpose is to find an approximation
that allows us to represent the Gumbel distribution using the Normal distribu-
tion. To this end, we use the Normal distribution to approximate the Gumbel
distribution under the max-operator in this section.
370 To commence, we define the exponential family of distributions and relevant
notations. For a random variable X following an exponential family distribution
with parameter ν, its PDF is given by
s
Γ( 3 ) 2
Γ( ν3 ) −( Γ( ν1 ) ) v |X|ν

ν
f (X) = e ν .
2Γ( ν1 ) Γ( ν1 )

18
In particular, define:
1−ν
Γ( 3 ) ν (C ν ) ν
θ = ν − 1, C = ( ν1 ) 2 , D0ν =
ν ν
,
Γ( ν ) 2Γ( ν1 )
ν θν ν(C ν ) ν ν 1
βN =[ ν
W 0 [ ν
(D0 N ) θν ]] ν ,
ν(C ) θ
1 1 1 1 1
D1 = −(1 − ) ν , D2ν = (1 − )(2 − ) ν 2 .
ν
ν C ν ν (C )

where W0 [·] is the real part of the Lambert W-function, and Γ(·) denotes the
375 Gamma function. For Γ(n + 12 ), it is defined:

1 (2n)! π
Γ(n + ) = .
2 n!4n

Notably, once ν is given, these parameters can be computed directly. Since we


are interested in the normal distribution, we take ν = 2 to reach the case that
f (X) follows the standard Normal distribution, actually:

1 X2
f (X)|ν=2 = √ e− 2 .

Based on the definitions above, we now present a numerical method in Lemma 7


that approximates a Gumbel distribution by the Normal distribution.

Lemma 7. [45] Suppose X1 , X2 , X3 , ..., XN are i.i.d variables from a Normal


distribution Normal(0, 1). Define fN (X) as the PDF of X = maxi (Xi ) and
380 g(X) as the PDF of a standard Gumbel(0, 1), then fN (X) ≃ 1
aN g( (X−b
aN
N)
),
where
"
1 θν (θν )2 − 6C ν D1ν
aN ≃ ν ν 1− ν + ν )4
2C βN 2C ν (βN )2 4(C ν )2 (βN
2(θν )3 − 32θν C ν D1ν − 20(C ν )2 ((D1ν )2 − 2D2ν )

− ,
16(C ν )3 (βN ν )6
" #
2
ν D1ν 2θν D1ν + 2C ν ((D1ν ) − 2D2ν )
bN ≃βN 1 + ν )4
− ν )6
.
2C ν (βN 8(C ν )3 (βN

Lemma 7 shows us how to approximate the Gumbel distribution with a finite


set of samples from the Normal distribution. It allows us to relax Assumption 3
to Assumption 3∗ for Normal initialization.

19
385 Assumption 3∗ . The initial Q̂0 (s, a) follows some standard Normal distri-
bution that is independent from (s, a) pairs.

In this way, Assumption 3 becomes standard. Then with the previous As-
sumptions 1, 2, and 4, we can extend Lemma 4 and Theorem 1 to Normal
initialization. To establish an intuitive understanding of the revised Theorem 1
390 and Lemma 4 under Normal initialization, we now present a toy example with
a finite state space.

Example 1. Consider a scenario with five states {si }5i=1 and an action space
of 5000 actions {aj }5000
j=1 . At each state si , taking any action aj results in

transition from si to si+1 with a constant reward r(si , aj ) ≡ 1.

395 Example 1 presents a finite state space that can be stored in Q ∈ R5×5000 .
Then the Bellman equation can be uniquely optimized by
5−i
X
Q∗ (si , :) = 0.99k .
k=0

where Q(si , :) denotes the i-th row of Q with respect to the i-th state.
Figure 4 verifies Lemma 4 and Theorem 1 with this toy example by visual-
izing the first 4 iterations of the Bellman errors. We use Normal(0, 1) random
400 initialization for each element in the Q table and employ (5) for iterating. Define

ϵt (s1 , :) = Q̂t (s1 , :) − Q∗ (s1 , :),

and εt (s1 , :) = (r(s1 , :) + max



(Q̂t (s2 , a′ ))) − Q̂t (s1 , :).
a

where ϵt (s1 , :) (row 1) follows a Gumbel distribution, and εt (s1 , :) (row 2) follows
a Logistic distribution. While this toy example makes many simplifications
for illustrative purposes only, we will provide further validation in Section 8
on complex real-world environments, in which cases obtaining the optimal Q∗
405 values is generally impossible.

5. The Proportional Reward Scaling Phenomenon and Bellman Error

The proportional reward scaling [9] problem describes a saturation phe-


nomenon preventing the reward scaling factor from expanding infinitely in RL.

20
Figure 4: The distribution of ϵ(s, a) (row 1, in purple) and ε(s, a) (row 2, in blue)
in the first four iterations with a randomly initialized Q table. The former roughly
follow Gumbel distributions, and the latter follow Logistic distributions.

Gao et al. [46] explained this problem by policy gradient. This section demon-
410 strates a rational explanation for the reward scaling problem from the standpoint
of the distribution of Bellman errors. We will establish a natural connection be-
tween reward scaling and the expectation of the Logistic distribution.
We start by connecting the proportional reward scaling problem and the
Bellman error. Following Remark 2, we facilitate the analysis with the special
415 case of Theorem 1, where
n
X r(s′ ,ai )
θ
ε (s, a) ∼ Logistic(−βθ ln e βθ
, βθ ).
i=1

We have shown in Section 4 that this distribution is biased, which poses a


challenge for approximation with a neural network. Alternatively, we seek to
transform it to support a nearly unbiased approximation. We explore whether
we can alleviate the bias in the Logistic distribution through reward scaling.
420 Specifically, we present the following theorem to guide scaling.

Theorem 2 (Positive Scaling upper bounds under Remark 2). Denote r+ and
r− the positive and negative rewards with r > 0 and r < 0, respectively. With
i1 + i2 + i3 = n, assume that:
n i1 i2
X r(s′ ,ai ) X r + (s′ ,ai ) X r − (s′ ,ai )
e βθ
= e βθ
+ e βθ
+ i3 .
i=1 i=1 i=1

If it satisfies:

21
425 1. i1 ̸= 0,
Pi1 r + (s′ ,ai ) Pi2 r − (s′ ,ai )
2. i=1 e βθ
r+ (s′ , ai ) + i=1 e βθ
r− (s′ , ai ) < 0,

then there exists an optimal scaling ratio φ∗ > 1, such that for any scaling ratio
φ that can effectively reduce the expectation of the Bellman error, it must satisfy

1 ≤ φ ≤ φ∗ .

The proof of Theorem 2 can be found in Appendix A.7.


Theorem 2 offers valuable insights into reward scaling bounds and associated
phenomena. Specifically, under the conditions outlined in Theorem 2, appropri-
430 ate scaling can rectify the biased expectation of the Logistic distribution toward
zero, thereby enhancing training performance significantly. However, the exis-
tence of the upper bounds prohibits excessive scaling to revert the corrected
expectation back to its previous state or worse state. We provide an example
to further demonstrate this scenario.

435 Example 2. Consider different scaling factors φ in training the BipedalWalker-


v3 environment. Figure 5 explores the expectation of the sampling error εθ (s, a).
Let {aj }5000
j=1 provide a sufficiently large action space for sampling. We find

that there are always positive rewards in these 5000 samples, which satisfies the
first condition in Theorem 2 that i1 ̸= 0. Meanwhile, for βθ = (0.5, 1, 2, 3),
Pi1 r+ (sβ′ ,ai ) + ′ Pi2 r− (sβ′ ,ai ) − ′
440
i=1 e
θ r (s , ai ) + i=1 e θ r (s , ai ) ≈ (−88, −105, −109, −117).
In other words, the second condition in Theorem 2 holds for all the βs under
investigation. Furthermore, the optimal φ∗ suggested in Theorem 2 can be ob-
served from the figure. In each of the 4 scenarios in Figure 5, the E[εθ (s, a)]
reaches its upper bound halfway through increasing the scaling factor φ. Based
445 on the empirical observation, we conclude that when the error variance is con-
siderably small, a scaling ratio of 10 ∼ 50 is recommended. Note that this
observation is consistent with the experimental results in [9].

Example 2 explains the existence of the upper bound on the scaling ratio.
It comprehends from a distributional perspective for enhancing the model per-
450 formance during training.

22
Figure 5: The change of E[εθ (s, a)] by assigning different βs. An optimal scaling ratio
φ∗ exists in all the scenarios.

6. Sampling strategy of Bellman Error with Logistic Distribution

This section delves into the considerations of batch size in neural networks,
building upon the theorem established in Section 4 that the Bellman error con-
forms to a Logistic distribution. As the direct application of tabular Q-Learning
455 proves inadequate for complex environments, the extension of established the-
orems, such as training a neural network, becomes crucial for gaining practical
significance. In this context, we explore the empirical choice of the batch size N
used in sampling Bellman errors for parameter updates. Our goal is to regulate
the error bound while maintaining computational efficiency. We employ the
460 Bias-Variance decomposition to analyze the sampling distribution and substan-
tiate the identification of a suitable N ∗ .
Firstly, we outline the problem we are addressing in this section.

Problem 1. Assume that we have a Logistic distribution denoted as Logistic(A, B).


We aim to draw points from this distribution to represent it. The more sampling
465 points we have, the more representative they are, and vice versa. The represen-
tativeness is measured using the sampling error Se in (16). Our objective is to
determine an appropriate batch size N ∗ that fits the Logistic distribution with
Se ≈ 1 × 10−6 .

To address this issue, we first define the empirical distribution function for
470 sampling. For {x1 , x2 , ..., xN } sampled from Logistic(A, B), the associated em-
pirical distribution function for this sequence is
N
(x ,x2 ,...,xN ) 1 X
F̂N 1 (t) = 1x ≤t .
N i=1 i

23
Figure 6: The differences between F (t) (true CDF of Logistic(0, 1)) and F (t) (empirical
CDF) with varying sample sizes N .

Following Definition 2, we denote F (t), f (t) as the CDF and PDF of the Logistic(A, B)
(A replaces λ, and B replaces η). The sampling error Se reads:
h i
(x ,x ,...,xN )
Se = Et E(x1 ,x2 ,...,xN ) (F̂N 1 2 (t) − F (t))2 . (16)

We next define the Bias-Variance decomposition for (16) in Lemma 8.

Lemma 8. The sampling error Se in (16) can be decomposed into Bias and
Variance terms. If we define:
h i
(x ,x ,...,xN )
F (t) = E(x1 ,x2 ,...,xN ) F̂N 1 2 (t) ,

475 then
Se = Et [Variance(t) + Bias(t)] = Variance + Bias,

where
h i
(x ,x ,...,xN ) (x ,x ,...,xN )
Variance(t) = E(x1 ,x2 ,...,xN ) (F̂N 1 2 (t))2 − E2(x1 ,x2 ,...,xN ) [F̂N 1 2 (t)],

Bias(t) = (F (t) − F (t))2 .

We provide a detailed derivation of Lemma 8 in Appendix A.10.


The key to estimating the sampling error Se lies in calculating F (t). For
N = 1, we show that:

F (t) = Ex1 [F̂Nx1 (t)] = Ex1 [1x1 ≤t ] = 1E[x1 ]≤t = 1A≤t .

480 For N ≥ 2, considering order statistics becomes essential, since


N N
(x ,x2 ,...,xN ) 1 X 1 X
F (t) = E(x1 ,x2 ,...,xN ) [F̂N 1 (t)] = E(x1 ,x2 ,...,xN ) [ 1xi ≤t ] = 1E[x(i) ]≤t .
N i=1 N i=1

24
Here x(i) denotes the i-th order statistics. To find E[x(i) ] for each x(i) , we
perform piecewise segmentation [47] on the PDF of each x(i) , which reads

N!
f x(i) (t) = (F (t))i−1 (1 − F (t))n−i f (t).
(i − 1)!(N − i)!

Theorem 3 reveals the method for computing the expectation of order statistics
under the Logistic distribution.

485 Theorem 3 (The Expectation of order statistics for the Logistic distribution).

N −i
" i−1
#
X 1 X1
E[x(i) ] = B[ − ]+A .
k k
k=1 k=1

The proof of Theorem 3 can be found in Appendix A.11.


With Theorem 3, we can directly calculate each E[x(i) ] analytically. Fig-
ure 6 compares F (t) with F (t) under Logistic(0, 1) with sample points N =
(2, 4, 8, 16), with F (t) calculating by following Theorem 3. We show that in-
490 creasing N leads to a more accurate estimation.
Note that the empirical distribution is a type of step function, i.e.,
N
(x ,x2 ,...,xN ) 1 X
E(x1 ,x2 ,...,xN ) [(F̂N 1 (t))2 ] = E(x1 ,x2 ,...,xN ) [( 1x ≤t )2 ]
N i=1 i
N −1
X i 2
= ( ) 1(E[x(i) ]≤t≤E[x(i+1) ]) + 1t>E[x(N ) ] ,
i=1
N
N
(x ,x2 ,...,xN )
X i
E2(x1 ,x2 ,...,xN ) [F̂N 1 (t)] = ( ( )1E[x(i) ]≤t )2
i=1
N
(x ,x2 ,...,xN )
= E(x1 ,x2 ,...,xN ) [(F̂N 1 (t))2 ].

Consequently, the Variance of this empirical distribution is zero, leading to the


estimation of the bias term as the sole concern in approaching Se . Suppose t is
 
uniformly sampled from the range E[x(1) ], E[x(N ) ] , then
"N −1 Z #
1 X E[x(i+1) ] i 2
Se = Et [Bias(t)] = (F (t) − ) dt , (17)
E[x(N ) ] − E[x(1) ] i=1 E[x(i) ] N

495 where E[x(i) ] follows the definition in Theorem 3 with a fixed size N . We can
then obtain the upper and lower limits of the integral in (17), leading to a direct

25
Table 1: The relationship between the sample size N and sampling error Se using (17).

2 4 8 16 32 64 128 256

Se 2×10−2 4×10−3 1×10−3 3×10−4 8×10−5 2×10−5 5×10−6 1×10−6

computation of E[x(i) ] in Se 2 . Table 1 reports the associated Se with respect to


varying N s, where the sampling error keeps reducing with an increased sample
size N . Notably, the errors are related to N and are independent of both
500 parameters A and B. Moreover, a moderate batch size of 256 is sufficient for
achieving a small Se ≈ 1 × 10−6 .
The results verify that assuming a Logistic distribution for the Bellman
error facilitates the determination of an appropriate batch size for training in
RL. Rather than being driven by performance metrics, the batch size can be
505 selected based on the precision requirements.

7. Logistic Likelihood Q-Learning

According to Section 4-Section 6. We have justified the rationality of mod-


eling the Bellman error with the Logistic distribution instead of the Normal
distribution. Based on this, we propose training a neural network with the Lo-
510 gistic maximum likelihood loss function in place of the conventional MSE-based
loss function.
As mentioned at the beginning of Section 4, we need to verify that the sam-
ples from the same training batch are independently and identically distributed
(i.i.d). Clearly, independence has been established by our theory, but they are
515 distributed in distinct locations, which complicates batch-wise network updates.
To facilitate the definition of the loss function, we consider the samples extracted
from a batch to be i.i.d.

2 We employ symbolic integration with the built-in ‘int’ function in MATLAB. Instead
of using numerical integration techniques, we leverage indefinite integral to achieve a direct
numerical result.

26
As the typical choice in deep RL networks for Q-updating, MSELoss is
based on the assumption that the estimation error follows a Normal distri-
520 bution Normal(0, σ). MSELoss is derived from the maximum likelihood estima-
tion function, if we sample n samples from the Bellman error and treat them
as εi (i = 1, 2, ..., n), then we have this log-likelihood function for the Normal
distribution:
n n n
Y √ X ε2i X 1
log[ p(εi )] = −n log( 2πσ) − 2
∝ − (εi )2 . (18)
i=1 i=1
2σ i=1
2

In Section 4.1, we have deduced that the Bellman error should follow a
525 biased Logistic distribution. Estimating the expectation for this distribution is
not straightforward for a neural network. We assume the Bellman error follows
Logistic(µ, σ) and derive the associated likelihood function as a replacement for
MSELoss.
We start from the PDF of εi ∼ Logistic(µ, σ), which reads:
−εi +µ
1 e σ
p(εi ) = . (19)
σ (1 + e −εσi +µ )2

530 By employing the log-likelihood function, we have


" n # n 
Y X (εi − µ)  −εi +µ

log p(εi ) = −n log(σ) + − − 2 log 1 + e σ . (20)
i=1 i=1
σ

This returns the Logistic Loss function (LLoss), i.e.,


N  
1 X εi − µ  −εi +µ
LLoss(µ, σ) = + 2 log 1 + e σ . (21)
N i=1 σ

We have demonstrated in Figure 1 (also see Appendix B.1 for additional vi-
sualizations) that the distribution of Bellman error evolves along training steps
and exhibits a stronger fit to the logistic distribution. Figure 7 further com-
535 pares the closeness of empirical Bellman error to Logistic, Normal, and Gum-
bel distributions. In all four environments, the Logistic distribution performs
better in fitting the empirical Bellman error. In addition to these visualized
comparisons, numerical evaluations will be provided later in Table 7-8 with
Kolmogorov–Smirnov (KS) statistic magnitudes [48].

27
Figure 7: The distribution of Bellman error. For all two online and two offline envi-
ronments, the Bellman errors fit better to the Logistic distribution than the Gumbel
and Normal distributions. More details are provided in Table 7-8.

540 We present the updating method for the Q network under LLoss for Deep-
Q-Network (DQN) [1] and SAC in Algorithm 1, we omit the algorithm’s main
body and solely present the improved sections. The omitted portions of the
algorithm are identical to the original DQN and SAC.
It is noteworthy that MSELoss and LLoss are strongly correlated when we
545 take µ = 0, with LLoss serving as a corrective function for MSELoss. The
following Theorem 4 reveals the relationship between MSELoss and LLoss when
ε is sufficiently small. In Section 8, we observe that setting µ = 0 has surpassed
the performance of MSELoss.

Theorem 4 (Relationship between LLoss and MSELoss). The MSELoss can be


550 used as an approximate estimation of LLoss(0, σ) when ε is sufficiently small,
i.e.,
1
LLoss(0, σ) = ln4 + MSELoss + o(ε3 ),
2
where o(ε3 ) third-order infinitesimal of ε when ε is sufficiently small.

Theorem 4 demonstrates that when µ = 0, the MSELoss can be regarded

28
Algorithm 1 The updating method for the Q network in DQN and SAC.
Initialization: Qθ with random weights, Vϕ with random weights(for SAC);
Initialization: Time step T , Total episode M , Learning rate lr;
Initialization: Location parameter µ, Scale parameter σ, Scaling factor h;
for episode ← 1 to M do
for t ← 1 to T do
... ...(these steps are the same as DQN/SAC)
Use (14) to calculate each εi (for DQN);
Calculate εi = r(si , ai ) + γVϕ (s′i ) − Qθ (si , ai ) (for SAC);
Update θ ← θ − lr∇θ LLoss(µ, σ, θ)
... ...(these steps are the same as DQN/SAC)
end
σ ← σ × h(episode+1)
end

as an approximate form of LLoss when higher-order terms are neglected. The


555 proof is in Appendix A.9.

Summary. Through the analysis in Sections 4-7, we concluded that it is more


theoretically sound to model the Bellman error with the Logistic distribution
in comparison to the Normal or Gumbel distribution. Meanwhile, presenting
the Bellman error with the Logistic regression can naturally address the pro-
560 portional reward scaling problem, which relationship can not be revealed by
assuming through the Normal distribution. In the next section, we validate the
effectiveness of our theory through experiments.

8. Experiment

This section conducts empirical evaluations on widely assessed online and


565 offline environments for validating the effectiveness of adopting LLoss in prac-
tice. In Section 8.1, we elucidate the experimental setups. In Section 8.2, we
analyze our model performance on 17 online and offline environments (8 online

29
Table 2: Hyperparameters setting for online training.

SAC (σ) CQL (σ) TAU Scaling max. Step

LunarLanderContinuous-v2 10 10 0.005 1 200


HalfCheetah-v2 10 10 0.005 1 200
Hopper-v4 10 10 0.005 1 200
Walker2d-v2 10 10 0.005 1 200
HumanoidStandup-v4 90 60 0.005 1 100
InvertedPendulum-v4 10 10 0.005 1 1000
InvertedDoublePendulum-v2 10 10 0.005 1 1000
BipedalWalker-v3 20 20 0.005 50 200

Table 3: Hyperparameters setting for offline training.

IQL (σ) Eval steps Train steps Expl. steps max. Step

hopper-medium-v2 10 100 100 100 100


walker2d-medium-v2 3 100 100 100 100
halfcheetah-medium-v2 3 100 100 100 100
hopper-medium-replay-v2 3 100 100 100 100
walker2d-medium-replay-v2 20 100 100 100 100
halfcheetah-medium-replay-v2 20 100 100 100 100
hopper-medium-expert-v2 10 100 100 100 100
halfcheetah-medium-expert-v2 3 100 100 100 100
walker2d-medium-expert-v2 5 100 100 100 100

and 9 offline). In Section 8.3, we perform additional investigations on our pro-


posed method, including the Kolmogorov-Smirnov (KS) test on the distribution
570 of Bellman error and other ablation studies.

8.1. Experiment Protocol

First, we introduce our basic experimental environment. We conducted our


experiments using gym (ver.0.23.1), mujoco (ver.2.3.7), and D4RL (ver.1.1).
For online RL tasks, training was carried out over 160, 000 iterations across 8
575 gym environments. Due to the training simplicity in offline RL, models trained
for offline tasks underwent up to 500 iterations across 9 D4RL environments
[49]. In both online and offline scenarios, training is stopped after 50 non-
improving steps.
Next, we provide the configurations. Following the analysis in Section 6, we

30
Figure 8: The average reward of SAC, LSAC, and XQL in online training.

Figure 9: The average reward of CQL, LCQL in online training.

580 set the batch size to 256 and µ = 0, scaling factor h = 0.999 for both online and
offline RL. The decision to set µ as 0 is grounded in our experimental findings,
where the utilization of an LLoss model with µ = 0 demonstrated a significantly
superior performance compared to models using MSELoss. Tables 2 and 3 report
the details of initializations for both online and offline training, respectively. For
585 online RL, we validate the improvement of employing LLoss on SAC [9]and CQL
[13]. Consequently, we specify the associated σ initialization. For unspecified
settings, we adhere to the default setup in [9]. Similarly, in Table 3, we report
the σ initialization for IQL [16]. One point that needs special emphasis is the
“Expl. steps” in Table 3 represent the agent we need to specify the number of
590 task-agnostic environment steps. All the programs are sourced from rlkit 3 .

3 https://github.com/rail-berkeley/rlkit

31
8.2. Results Analysis

Online RL. We made improvements based on the official implementation of


SAC 4 and CQL 5 . We replaced MSELoss with LLoss to observe the performance
improvement. The enhanced methods with LLoss are referred to as LSAC and
595 LCQL, respectively. During training, we fixed the learning rate to 3×10−4 , the
discount factor γ to 0.99. To guarantee that the performance enhancement is
completely attributed to the modification on the loss function, all other initial-
izations are kept identical for MSELoss and LLoss variants. In the comparison
with XQL, we fine-tuned XQL based on the β range proposed by its authors.
600 The purpose of comparing with XQL is to assess the correctness of Gumbel
distribution versus Logistic distribution. The purpose of comparing with SAC
is to evaluate the correctness of Normal distribution versus Logistic distribu-
tion. In our setting, Most of the environments were run under the condition of
200 max steps. We trained each environment for 160,000 iterations. Based on
605 the rolling epoch timeline, we plotted and stored the Average Reward for each
2000 epoch. It can be observed from Figure 8 and Figure 9 that compared to
MSELoss, LLoss demonstrates more prominent performance in both SAC and
CQL. The detailed results for the above figures are available in Table 5. The
enhancement in Table 5 is calculated as follows:
i i
For the ith environment, denote RModel and RLModel as the average reward
obtained by the baseline Model (SAC, CQL) and our variant (LSAC, LCQL),
respectively. The online enhancement in Table 5 is defined as:
i i
RLModel − RModel
Enhancementonline (i) = i
.
RModel

610 Analysis of Maximum Rewards in Online RL. As mentioned in Sec-


tion 4.1, one significant advantage of the Logistic distribution is its ability to
expedite training in the early stages, implying a faster training rate compared

4 https://github.com/haarnoja/sac
5 https://github.com/aviralkumar2907/CQL

32
Figure 10: The average reward of IQL and LIQL in offline training.

to the Normal distribution. so we analyze when these environments under on-


line training can reach their maximum reward values and what these maximum
615 reward values are here, as shown in Table 4. From the results, we can see that
LLoss can accelerate the attainment of the maximum reward to some extent
and achieve a better maximum reward.

Offline RL. As SAC is not suitable for offline training, we conducted improved
experiments based on the IQL components. We set the maximum iteration count
620 as 500 and incorporated a variance threshold of 5 to determine convergence for
50 epochs. The reason we use so few epochs is that we greatly reduce the
difficulty of each task. In other words, we set a maximum step size for the agent
instead of letting it run to the optimal solution. The method of controlling
variables is the same as in the online setting. Our algorithm is referred to
625 as LIQL. Due to some dimensional discrepancies between the IQL algorithm
provided by rlkit and the IQL algorithm, we use the improvement ratio relative
to the IQL baseline as the measure of algorithm performance. The change in
the average reward during training is depicted in Figure 10, and relevant details
are presented in Table 6. The results also indicate that our model exhibits the

33
Table 4: The maximum reward of online training over 10 random repetitions with
parentheses reporting the number of epochs (in hundreds) to achieve the results.

SAC CQL XQL LSAC (Ours) LCQL (Ours)

LunarLander-Continuous-v2 194.85 (900) 154.02 (1350) 211.90 (340) 221.75 (110) 156.72 (220)
HalfCheetah-v2 847.20 (1520) 739.38 (1500) 835.24 (1530) 856.14 (1300) 761.77 (1420)
Hopper-v4 628.20 (1510) 616.30 (430) 618.32 (1200) 635.09 (1100) 594.98 (140)
Walker2d-v2 427.70 (1340) 360.42 (1570) 327.14 (340) 465.08 (1270) 387.23 (1210)
HumanoidStandup-v4 15142.51 (1590) 15209.97 (1550) 13032.94 (1280) 23771.01 (760) 15487.68 (1230)
InvertedPendulum-v4 1001.00 (210) 1001.00 (280) 1001.00 (270) 1001.00 (190) 1001 (280)
InvertedDouble-Pendulum-v2 9359.82 (380) 9361.33 (510) 9360.56 (540) 9362.28 (380) 9363.40 (500)
BipedalWalker-v3 79.05 (1490) 80.11 (1560) 81.10 (1540) 82.53 (1330) 83.77 (1410)

Table 5: Average reward of online training over 10 random repetitions. The red values
indicate the enhancement of LLoss over its MSELoss counterparts.

SAC CQL XQL LSAC (Ours) LCQL (Ours)

LunarLander-Continuous-v2 19.99 104.15 -489.19 133.95 (570.09 %) 112.57 (8.08 %)


HalfCheetah-v2 696.96 653.62 684.96 714.54 (2.52 %) 675.33 (3.32 %)
Hopper-v4 509.47 495.34 487.08 544.72 (6.92 % ) 515.30 (4.03 % )
Walker2d-v2 221.46 194.59 3.42 251.63 (13.62 % ) 219.27 (12.68 % )
HumanoidStandup-v4 14,157.95 14,166.01 8,030.26 16,781.59 (18.53 % ) 14,258.06 (0.65 % )
InvertedPendulum-v4 1001.00 1001.00 1001.00 1001.00 (0.00 %) 1001.00 (0.00 %)
InvertedDouble-Pendulum-v2 8466.48 4295.46 3290.36 8941.93 (5.62 %) 4647.64 (8.20 % )
BipedalWalker-v3 68.69 43.91 64.56 77.59 (12.96 %) 71.96 (63.88 %)

avg. enhancement 78.78 % 12.61 %

630 highest enhancement ratio. The enhancement in Table 6 is calculated as follows:


i i
For the ith environment, denote RIQL and RLModel as the average reward
obtained by the baseline Model (IQL) and other Models (LIQL, XQL, CQL,
TD3+BC, one-step RL), respectively. The offline enhancement in Table 6 is
defined as:
RiModel − RiIQL
Enhancementoffline (i) = .
RiIQL

8.3. KS tests and Ablation Study

KS Tests. The Kolmogorov-Smirnov (KS) tests, introduced by [48], are em-


ployed to examine whether data conforms to a particular distribution. We

34
Table 6: The average reward and enhanced ratio after the offline training, all algo-
rithms have calculated the enhancement ratios relative to the IQL.

enhancement over IQL (%)

IQL LIQL LIQL XQL CQL TD3+BC one-step RL


(ours)

hopper-v2 228.19 240.71 5.49 7.24 -11.77 -10.56 2.11


medium

walker2d-v2 138.42 161.26 16.50 4.09 -7.41 6.89 -10.11


halfcheetah-v2 319.41 335.44 5.02 0.63 -7.18 1.89 4.47

hopper-v2 243.92 264.30 8.35 1.36 2.94 0.90 -13.80


replay

walker2d-v2 153.09 176.44 15.25 2.71 4.46 10.69 2.96


halfcheetah-v2 215.28 221.31 2.80 2.75 0.31 -35.70 -33.02

hopper-v2 224.49 237.38 5.74 17.05 15.19 7.10 7.73


expert

walker2d-v2 261.75 270.15 3.21 0.46 -0.73 0.45 4.61


halfcheetah-v2 305.14 347.30 13.82 3.58 5.65 4.61 12.90

avg. enhancement 8.46 4.43 0.16 -1.53 3.10

conducted KS tests on the Bellman error for each environment. It includes sta-
635 tistical R2 and other statistical test parameters. The test results are presented
in Table 7 and Table 8. The KS tests results indicate that our assumption of
the Logistic distribution is more accurate than the other two distributions.

Sensitivity Analysis. We conducted a sensitivity analysis on the variation


of σ across different environments in both online and offline settings. We con-
640 ducted proportional σ variations for each environment to observe the changes
in the final average reward and maximum average reward. We present results
in Figure 11 and put the details of the remaining results in Appendix B.4. The
results indicate that the performance of our approach within a certain range of σ
variations outperforms the MSELoss and exhibits a certain level of robustness.

645 9. Conclusion and Future Direction

In this research, we discussed different formulations of Bellman error from a


distributional perspective. By assuming the Logistic distribution for the Bell-

35
Table 7: The fitness and KS tests of Bellman-error for online RL.

R2 ↑ SSE (×10−4 ) ↓ RMSE (×10−4 ) ↓ KS statistic ↓

Logistic Gumbel Normal Logistic Gumbel Normal Logistic Gumbel Normal Logistic Gumbel Normal

LunarLanderContinuous-v2 0.985 0.971 0.975 1.119 2.224 1.902 7.555 10.653 9.850 0.052 0.070 0.071
HalfCheetah-v2 0.991 0.990 0.989 1.344 1.425 1.549 8.282 8.405 8.888 0.026 0.047 0.033
Hopper-v4 0.989 0.985 0.981 2.697 3.793 4.807 11.734 13.912 15.661 0.067 0.073 0.085
Walker2d-v2 0.988 0.967 0.975 0.900 2.549 1.903 6.778 11.404 9.854 0.054 0.084 0.072
HumanoidStandup-v4 0.667 0.641 0.628 27.164 29.279 30.318 37.228 38.652 39.331 0.269 0.322 0.291
InvertedPendulum-v4 0.983 0.963 0.971 20.961 46.307 35.772 32.702 48.606 42.721 0.115 0.175 0.117
InvertedDoublePendulum-v4 0.999 0.981 0.998 0.249 5.623 0.324 3.959 16.938 4.063 0.021 0.079 0.023
BipedalWalker-v3 0.997 0.979 0.990 1.206 7.888 3.563 7.843 20.061 13.482 0.039 0.101 0.057

Table 8: The fitness and KS tests of Bellman-error for offline RL.

R2 ↑ SSE (×10−4 ) ↓ RMSE (×10−4 ) ↓ KS statistic ↓

Logistic Gumbel Normal Logistic Gumbel Normal Logistic Gumbel Normal Logistic Gumbel Normal

hopper-medium-v2 0.981 0.975 0.976 10.175 13.975 13.191 29.617 34.709 33.722 0.040 0.094 0.053
walker2d-medium-v2 0.927 0.923 0.913 15.714 15.915 18.563 36.801 37.129 40.003 0.062 0.072 0.076
halfcheetah-medium-v2 0.836 0.831 0.833 9.087 9.394 9.223 12.314 12.941 12.444 0.050 0.052 0.069
halfcheetah-medium-replay-v2 0.852 0.813 0.836 12.864 16.242 14.221 33.301 37.416 35.012 0.031 0.105 0.048
walker2d-medium-replay-v2 0.950 0.908 0.937 10.514 19.256 13.262 30.101 40.743 33.812 0.075 0.149 0.093
hopper-medium-replay-v2 0.954 0.927 0.948 10.773 17.278 12.265 30.475 38.594 32.516 0.038 0.104 0.049
hopper-medium-expert-v2 0.985 0.970 0.982 8.054 17.097 10.162 26.351 38.391 29.598 0.056 0.105 0.059
walker2d-medium-expert-v2 0.981 0.959 0.973 11.186 23.807 15.584 31.053 45.302 36.653 0.067 0.138 0.075
halfcheetah-medium-expert-v2 0.919 0.869 0.913 12.722 20.568 13.719 33.117 42.108 34.391 0.036 0.098 0.045

man error and integrating the Logistic maximum likelihood function into the
associated loss function, we observed enhanced training efficacy in both online
650 and offline RL, marking a departure from the typical use of Normal or Gum-
bel distributions. Our theory’s validity is substantiated by rigorous analysis
and proofs, as well as empirical evaluations. Moreover, we naturally integrate
the Bellman error distribution with the reward scaling problem and propose a
sampling scheme based on this distribution for error limit control.
655 While we have introduced a novel avenue for improving RL optimization
focusing on the Bellman error, there remain compelling future directions for
exploration. For example, extending our analysis beyond the Bellman itera-
tive equation to include soft Bellman iterations could offer further insights. The
formulation of the state transition function might also benefit from a linear com-
660 bination of Gumbel distributions. Moreover, exploring innovative methods for
learning from an unknown biased distribution could be another promising direc-
tion, aligning with the inherently biased nature of the distribution of Bellman

36
Figure 11: The relationship between the variation of σ and the maximum average
reward/average reward in 4 environments (2 online and 2 offline).

error.

References

665 [1] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wier-


stra, M. Riedmiller, Playing atari with deep reinforcement learning,
arXiv:1312.5602 (2013).

[2] L. Kaiser, M. Babaeizadeh, P. Milos, B. Osinski, R. H. Campbell,


K. Czechowski, D. Erhan, C. Finn, P. Kozakowski, S. Levine, et al., Model-
670 based reinforcement learning for atari, arXiv:1903.00374 (2019).

[3] W. Qi, H. Fan, H. R. Karimi, H. Su, An adaptive reinforcement learning-


based multimodal data fusion framework for human–robot confrontation
gaming, Neural Networks 164 (2023) 489–496.

[4] Y.-D. Kwon, J. Choo, B. Kim, I. Yoon, Y. Gwon, S. Min, Pomo: Policy
675 optimization with multiple optima for reinforcement learning, Advances in
Neural Information Processing Systems 33 (2020) 21188–21198.

[5] A. Hottung, Y.-D. Kwon, K. Tierney, Efficient active search for combina-
torial optimization problems, arXiv:2106.05126 (2021).

37
[6] J. Bi, Y. Ma, J. Wang, Z. Cao, J. Chen, Y. Sun, Y. M. Chee, Learning gen-
680 eralizable models for vehicle routing problems via knowledge distillation,
arXiv:2210.07686 (2022).

[7] R. Bellman, The theory of dynamic programming, Bulletin of the American


Mathematical Society 60 (6) (1954) 503–515.

[8] T. Haarnoja, A. Zhou, P. Abbeel, S. Levine, Soft actor-critic: Off-policy


685 maximum entropy deep reinforcement learning with a stochastic actor, in:
International conference on machine learning, PMLR, 2018, pp. 1861–1870.

[9] T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan, V. Ku-


mar, H. Zhu, A. Gupta, P. Abbeel, et al., Soft actor-critic algorithms and
applications, arXiv:1812.05905 (2018).

690 [10] P. Christodoulou, Soft actor-critic for discrete action settings,


arXiv:1910.07207 (2019).

[11] P. N. Ward, A. Smofsky, A. J. Bose, Improving exploration in soft-actor-


critic with normalizing flows policies, arXiv:1906.02771 (2019).

[12] J. Pan, J. Huang, G. Cheng, Y. Zeng, Reinforcement learning for auto-


695 matic quadrilateral mesh generation: A soft actor–critic approach, Neural
Networks 157 (2023) 288–304.

[13] A. Kumar, A. Zhou, G. Tucker, S. Levine, Conservative q-learning for


offline reinforcement learning, Advances in Neural Information Processing
Systems 33 (2020) 1179–1191.

700 [14] Ö. Z. Bayramoğlu, E. Erzin, T. M. Sezgin, Y. Yemez, Engagement re-


warded actor-critic with conservative q-learning for speech-driven laughter
backchannel generation, in: Proceedings of the 2021 International Confer-
ence on Multimodal Interaction, 2021, pp. 613–618.

[15] J. Lyu, X. Ma, X. Li, Z. Lu, Mildly conservative q-learning for offline
705 reinforcement learning, arXiv:2206.04745 (2022).

38
[16] I. Kostrikov, A. Nair, S. Levine, Offline reinforcement learning with implicit
q-learning, arXiv:2110.06169 (2021).

[17] D. Garg, J. Hejna, M. Geist, S. Ermon, Extreme q-learning: Maxent rl


without entropy, arXiv:2301.02328 (2023).

710 [18] L. Baird, Residual algorithms: Reinforcement learning with function ap-
proximation, in: Machine Learning Proceedings 1995, Elsevier, 1995, pp.
30–37.

[19] P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, D. Meger,


Deep reinforcement learning that matters, in: Proceedings of the AAAI
715 conference on artificial intelligence, Vol. 32, 2018.

[20] Z. Zhang, S. Zohren, R. Stephen, Deep reinforcement learning for trading,


The Journal of Financial Data Science (2020).

[21] P. Hansen-Estruch, I. Kostrikov, M. Janner, J. G. Kuba, S. Levine,


Idql: Implicit q-learning as an actor-critic method with diffusion policies,
720 arXiv:2304.10573 (2023).

[22] J. Hejna, D. Sadigh, Inverse preference learning: Preference-based rl with-


out a reward function, arXiv:2305.15363 (2023).

[23] J. Li, X. Hu, H. Xu, J. Liu, X. Zhan, Y.-Q. Zhang, Proto: Iterative
policy regularized offline-to-online reinforcement learning, arXiv preprint
725 arXiv:2305.15669 (2023).

[24] J. Bas-Serrano, S. Curi, A. Krause, G. Neu, Logistic q-learning, in: Inter-


national Conference on Artificial Intelligence and Statistics, PMLR, 2021,
pp. 3610–3618.

[25] S. Fujimoto, D. Meger, D. Precup, O. Nachum, S. S. Gu, Why should


730 i trust you, bellman? the bellman error is a poor replacement for value
error, arXiv:2201.12417 (2022).

39
[26] F. Lu, P. G. Mehta, S. P. Meyn, G. Neu, Convex q-learning, in: 2021
American Control Conference (ACC), IEEE, 2021, pp. 4749–4756.

[27] F. Lu, P. G. Mehta, S. P. Meyn, G. Neu, Convex analytic theory for convex
735 q-learning, in: 2022 IEEE 61st Conference on Decision and Control (CDC),
IEEE, 2022, pp. 4065–4071.

[28] J. P. Zitovsky, D. De Marchi, R. Agarwal, M. R. Kosorok, Revisiting bell-


man errors for offline model selection, in: International Conference on Ma-
chine Learning, PMLR, 2023, pp. 43369–43406.

740 [29] M. Qian, S. Mitsch, Reward shaping from hybrid systems models in re-
inforcement learning, in: NASA Formal Methods Symposium, Springer,
2023, pp. 122–139.

[30] A. Gupta, A. Pacchiano, Y. Zhai, S. Kakade, S. Levine, Unpacking re-


ward shaping: Understanding the benefits of reward engineering on sample
745 complexity, Advances in Neural Information Processing Systems 35 (2022)
15281–15295.

[31] M. Palan, N. C. Landolfi, G. Shevchuk, D. Sadigh, Learning reward func-


tions by integrating human demonstrations and preferences, arXiv preprint
arXiv:1906.08928 (2019).

750 [32] E. Bıyık, N. Huynh, M. J. Kochenderfer, D. Sadigh, Active preference-


based gaussian process regression for reward learning, arXiv preprint
arXiv:2005.02575 (2020).

[33] V. Konda, J. Tsitsiklis, Actor-critic algorithms, Advances in neural infor-


mation processing systems 12 (1999).

755 [34] G. Neu, A. Jonsson, V. Gómez, A unified view of entropy-regularized


markov decision processes, arXiv:1705.07798 (2017).

[35] C. J. Watkins, P. Dayan, Q-learning, Machine learning 8 (1992) 279–292.

40
[36] D. P. Bertsekas, Constrained optimization and Lagrange multiplier meth-
ods, Academic press, 2014.

760 [37] K. Doya, K. Samejima, K.-i. Katagiri, M. Kawato, Multiple model-based


reinforcement learning, Neural computation 14 (6) (2002) 1347–1369.

[38] Y. Chandak, G. Theocharous, J. Kostas, S. Jordan, P. Thomas, Learning


action representations for reinforcement learning, in: International confer-
ence on machine learning, PMLR, 2019, pp. 941–950.

765 [39] M. L. Littman, An optimization-based categorization of reinforcement


learning environments, From animals to animats 2 (1993) 262–270.

[40] B. M. Méndez-Hernández, E. D. Rodrı́guez-Bazan, Y. Martinez-Jimenez,


P. Libin, A. Nowé, A multi-objective reinforcement learning algorithm for
jssp, in: Artificial Neural Networks and Machine Learning–ICANN 2019:
770 Theoretical Neural Computation: 28th International Conference on Artifi-
cial Neural Networks, Munich, Germany, September 17–19, 2019, Proceed-
ings, Part I 28, Springer, 2019, pp. 567–584.

[41] V. François-Lavet, R. Fonteneau, D. Ernst, How to discount deep re-


inforcement learning: Towards new dynamic strategies, arXiv preprint
775 arXiv:1512.02011 (2015).

[42] R. Amit, R. Meir, K. Ciosek, Discount factor as a regularizer in reinforce-


ment learning, in: International conference on machine learning, PMLR,
2020, pp. 269–278.

[43] R. A. Fisher, L. H. C. Tippett, Limiting forms of the frequency distribution


780 of the largest or smallest member of a sample, in: Mathematical proceed-
ings of the Cambridge philosophical society, Vol. 24, Cambridge University
Press, 1928, pp. 180–190.

[44] F. J. Marques, C. A. Coelho, M. De Carvalho, On the distribution of lin-


ear combinations of independent gumbel random variables, Statistics and
785 Computing 25 (2015) 683–701.

41
[45] L. Zarfaty, E. Barkai, D. A. Kessler, Accurately approximating extreme
value statistics, Journal of Physics A: Mathematical and Theoretical 54 (31)
(2021) 315205.

[46] L. Gao, J. Schulman, J. Hilton, Scaling laws for reward model overopti-
790 mization, in: International Conference on Machine Learning, PMLR, 2023,
pp. 10835–10866.

[47] J. E. Gentle, Computational statistics, Springer, 2010.

[48] K. An, Sulla determinazione empirica di una legge didistribuzione, Giorn


Dell’inst Ital Degli Att 4 (1933) 89–91.

795 [49] J. Fu, A. Kumar, O. Nachum, G. Tucker, S. Levine, D4rl: Datasets for
deep data-driven reinforcement learning, arXiv preprint arXiv:2004.07219
(2020).

42
Appendix A. Proof for Lemmas and Theorems.

Appendix A.1. Proof for Lemma 2:

800 Lemma 2. If a random variable X ∼ Gumbel(A, B) follows Gumbel distribution


with location A and scale B, then X + C ∼ Gumbel(C + A, B) and DX ∼
Gumbel(DA, DB) with arbitrary constants C ∈ R and D > 0.

Proof. The Cumulative Probability Density Function (CDF) P for Gumbel(A, B)


has been given in Section 3.4 that:
(X−A)

P (X) = e−e B
.

So we have:
(α−(C+A))

P (Y < α) = P (X < α − C) = e−e B
,
( α −A) (α−DA)
α −e−
D −
P (Z < α) = P (X < )=e B
= e−e DB
,
D
Which means:
Y ∼ Gumbel(C + A, B).

Z ∼ Gumbel(DA, DB).

Appendix A.2. Proof for Lemma 3

805 Lemma 3. For a set of mutually independent random variables Xi ∼ Gumbel(Ci , β)


(1 ≤ i ≤ n), where Ci is a constant related to Xi and β is a positive constant,
Pn 1
then maxi (Xi ) ∼ Gumbel(β ln i=1 e β Ci , β).

Proof. As mentioned in Section 3.4, the Cumulative Probability Density Func-


(x−Ci )

tion (CDF) P for Gumbel(Ci , β) is : e−e
β
. Based on the independence,
we have:

P (max(Xi ) < A) = P (X1 < A, X2 < A, X3 < A, ..., Xn < A).


i

where
(A−Ci )

P (Xi < A) = e−e
β
.

43
Then
(A−C1 ) (A−C2 ) (A−Cn ) (A−Ci )
− − − Pn −
P (X1 < A, X2 < A, X3 < A, ..., Xn < A) = e−e ·e−e ...·e−e = e−
β β β e β
i=1 ,
Ci Ci
Ci
−A −A Pn − A +ln( n β
P
β i=1 e
eln( i=1 e
Pn
P (X1 < A, X2 < A, X3 < A, ..., Xn < A) = e−e = e−e = e−e
β e β β ) β )
i=1 ,
Ci
− 1 [A−β ln( n β )]
P
−e β i=1 e
P (max(Xi ) < A) = e ,
i

So this means
n
X 1
max(Xi ) ∼ Gumbel(β ln e β Ci , β).
i
i=1

Appendix A.3. Proof for Lemma 4


810 Lemma 4. For ϵt (s, a) defined in (12), under Assumptions 1-3, we show that:

ϵt (s, a) ∼ Gumbel(Ct (s, a) − γ max



(Q∗ (s′ , a′ )), βt ),
a

where
C1 (s, a) = C1 ,
n
X r(s′ ,ai )
C2 (s, a) = γ(C1 (s, a) + β1 ln e β1
),
i=1
and
n
X r(s′ ,ai )+Ct−1 (s′ ,ai )

Ct (s, a) = γ(βt−1 ln e βt−1


)(t ≥ 3).
i=1
For βt , it always holds that

βt = γ t−1 β1 (t ≥ 1).

Besides, ϵt (s, a) are independent for arbitrary pairs (s, a).

Proof. If we use the Bellman operator during updating at the t-th iteration for
estimating from (5), we will have:

Q̂t (s, a) = r(s, a) + γ max



(Q̂t−1 (s′ , a′ )).
a

In Section 4.1, we have shown that Q∗ for (5) should satisfy:

Q∗ (s, a) = r(s, a) + γ max



(Q∗ (s′ , a′ )).
a

44
By subtracting these two equations, it can be deduced that the error ϵt (s, a) at
the t-th step is:

ϵt (s, a) = γ[max

(Q̂t−1 (s′ , a′ )) − max

(Q∗ (s′ , a′ ))].
a a

815 Let’s see what’s going on :


When t = 1, apparently:

ϵ1 (s, a) = γ max

(Q̂0 (s′ , a′ )) − γ max

(Q∗ (s′ , a′ )).
a a

We have shown that:

γ max

(Q̂0 (s′ , a′ )) ∼ Gumbel(C1 , β1 ).
a

While γ maxa′ (Q∗ (s′ , a′ )) is a constant, not a random variable, so it does not
affect the Gumbel distribution type, but it affects the location of this Gumbel
distribution. According to Lemma 2, we will have:

ϵ1 (s, a) ∼ Gumbel(C1 − γ max



(Q∗ (s′ , a′ )), β1 ).
a

Let us see what happens if we replace s with s′ . By assumption, the action space
A has finite elements, which means A = [a1 , a2 , ..., an ], so we can enumerate all
′′ ′′ ′′
actions to a list with the state-action pair: [(s′ , a1 , r1 , s1 ), (s′ , a2 , r2 , s2 ), ..., (s′ , an , rn , sn )],
′′
where si is gotten from T (s′ , ai ). According to the above discussion, we have:

′′
ϵ1 (s′ , ai ) ∼ Gumbel(C1 − γ max

(Q∗ (si , a′ )), β1 ).
a

Noticed that ϵ1 (s′ , ai ) and ϵ1 (s′ , aj ) are independent when i ̸= j, this is because
there is obviously no relationship between the two different actions ai and aj .
This is not a difficult fact to understand. In fact, we will show in Fact 1 that
for any different (s, a) pair, ϵ1 (s, a) will be independent.
820 Fact 1: For any different (s, a) pair, ϵ1 (s, a) will be independent.
This may be surprising, because according to the mapping T , this has es-
tablished a relationship between s′k and (s, ak ) with s′k = T (s, ak ).
Proof for the Fact 1:

45
For any two different pairs (s1 , ak ) and (s2 , aj ), define T (s1 , ak ) = s′1k and
T (s2 , aj ) = s′2j , noticed that:

ϵ1 (s1 , ak ) = γ max

(Q̂0 (s′1k , a′ ))−γ max

(Q∗ (s′1k , a′ )) ∼ Gumbel(C1 −γ max

(Q∗ (s′1k , a′ )), β1 ).
a a a

1
ϵ (s2 , aj ) = γ max

(Q̂ 0
(s′2j , a′ ))−γ max

(Q ∗
(s′2j , a′ )) ∼ Gumbel(C1 −γ max

(Q∗ (s′2j , a′ )), β1 ).
a a a

According to Assumption 3, we show that γ maxa′ (Q̂ 0


(s′1k , a′ )) and γ maxa′ (Q̂0 (s′2j , a′ ))
825 are independent. This is due to the randomness of the initialization. On the
other hand, γ maxa′ (Q∗ (s′1k , a′ )) and γ maxa′ (Q∗ (s′2j , a′ )) are two fixed num-
ber. Although they are constrained by the Bellman equation, they are not
variables. Therefore, ϵ1 (s1 , ak ) is independent with ϵ1 (s2 , aj ), in this way we
have proved Fact 1. We will see later that only at the same time t can keep
830 this property.
Let us continue our discussion. When t = 2, we will have:

ϵ2 (s, a) = γ max

(Q∗ (s′ , a′ ) + ϵ1 (s′ , a′ )) − γ max

(Q∗ (s′ , a′ )).
a a

Let Q∗ (s′ , ai ) + ϵ1 (s′ , ai ) be Li , according to Fact 1 we have discussed above,


Li is a sequence of mutually independent countable random variables, using
Lemma 2, we can have:
′′
Li ∼ Gumbel(Q∗ (s′ , ai ) + C1 − γ max

(Q∗ (si , a′ )), β1 ).
a

Noticed that:
′′
Q∗ (s′ , ai ) = r(s′ , ai ) + γ max

(Q∗ (si , a′ )).
a

So:
Li ∼ Gumbel(r(s′ , ai ) + C1 , β1 ).

Each Li is independent of each other. According to Lemma 3, then


n
X r(s′ ,ai )
∗ ′ 1 ′
maxai (Q (s , ai )+ϵ (s , ai )) = maxi (Li ) ∼ Gumbel(C1 +β1 ln e β1
, β1 ).
i=1

Let γmaxi (Li ) ∼ Gumbel(C2 (s, a), β2 ). Because the discounted factor γ is
positive number, according to Lemma 2:
n
X r(s′ ,ai )
C2 (s, a) = γ(C1 + β1 ln e β1
).
i=1

46
β2 = γ(β1 ).

So:
ϵ2 (s, a) ∼ Gumbel(C2 (s, a) − γ max

(Q∗ (s′ , a′ )), β2 ).
a

We also have a Fact 2 similar to Fact 1.


Fact 2: For any different (s, a) pair, ϵ2 (s, a) will be independent.
Proof for the Fact 2: For any two different pairs (s1 , ak ) and (s2 , aj ),
define T (s1 , ak ) = s′1k and T (s2 , aj ) = s′2j , noticed that:

ϵ2 (s1 , ak ) = γ max

(Q∗ (s′1k , a′ ) + ϵ1 (s′1k , a′ )) − γ max

(Q∗ (s′1k , a′ )).
a a

From here we can see that, ϵ2 (s1 , ak ) and any ϵ1 (s′1k , aj ) are not indepen-
dent, on the other hand:

ϵ2 (s2 , aj ) = γ max

(Q∗ (s′2j , a′ ) + ϵ1 (s′2j , a′ )) − γ max

(Q∗ (s′2j , a′ )).
a a

According to Fact 1, for any action am , an , ϵ1 (s′1k , an ) is independent with


any ϵ1 (s′2j , am ). Q∗ (s′2j , am ) is a fixed number. It does not introduce any ran-
835 domness, so it does not affect independence and randomness. So ϵ1 (s′1k , an )
is independent with γ maxa′ (Q∗ (s′2j , a′ ) + ϵ1 (s′2j , a′ )) for any n. Similarly,
γ maxa′ (Q∗ (s′2j , a′ ) + ϵ1 (s′2j , a′ )) is independent with γ maxa′ (Q∗ (s′1k , a′ ) +
ϵ1 (s′1k , a′ )). For the rest of the part γ maxa′ (Q∗ (s′1k , a′ )) and γ maxa′ (Q∗ (s′2j , a′ )).
they can all be treated as the fix constants, so they don’t affect the indepen-
840 dence, so we have proved the Fact 2.
At this point, we can already discern some patterns. However, to ensure
thoroughness, we will conduct one more iteration here for t = 3:

ϵ3 (s, a) = γ max

(Q∗ (s′ , a′ ) + ϵ2 (s′ , a′ )) − γ max

(Q∗ (s′ , a′ )).
a a

In this case, Let Q∗ (s′ , ai ) + ϵ2 (s′ , ai ) be Mi , obviously there is no connection


between ai and aj when i ̸= j. So Mi is a sequence of mutually independent
countable random variables, using Lemma 3 again, we can have:

′′
Mi ∼ Gumbel(Q∗ (s′ , ai ) + C2 (s′ , ai ) − γ max

(Q∗ (si , a′ )), β2 ).
a

47
Mi ∼ Gumbel(r(s′ , ai ) + C2 (s′ , ai ), β2 ).

According to Lemma 3, then


n
X r(s′ ,ai )+C2 (s′ ,ai )
∗ ′ 2 ′
maxai (Q (s , ai )+ϵ (s , ai )) = maxi (Mi ) ∼ Gumbel(β2 ln e β2
, β2 ).
i=1

Let:
n
X r(s′ ,ai )+C2 (s′ ,ai )
C3 (s, a) = γ(β2 ln e β2
).
i=1

β3 = γβ2 .

We will have:

ϵ3 (s, a) ∼ Gumbel(C3 (s, a) − γ max



(Q∗ (s′ , a′ )), β3 ).
a

Similar to Fact 1 and Fact 2, of course, there is Fact 3 to hold.


Fact 3: For any different (s, a) pair, ϵ3 (s, a) will be independent.
The proof for Fact 3 is the same as Fact 2.
Continuing in this manner, we will find that when (t ≥ 3), the approach
becomes identical. We will have a general iteration format:

ϵt (s, a) ∼ Gumbel(Ct (s, a) − γ max



(Q∗ (s′ , a′ )), βt ).
a

Where
n
X r(s′ ,ai )
C2 (s, a) = γ(C1 + β1 ln e β1
).
i=1
n
X r(s′ ,ai )+Ct−1 (s′ ,ai )

Ct (s, a) = γ(βt−1 ln e βt−1


)(t ≥ 3).
i=1

We also have a summary fact here.


845 Summary Fact: For any different (s, a) pair, ϵt (s, a) will be independent
for the same t.
The reason we do not merge C2 and Ct is to emphasize that C1 is a constant.
If the following special cases in Remark 1 can be satisfied, it will be found that
all Ci for any i are constants without distinction.
850 Proof for Remark 1:

48
If for ∀s1 , s2 , let us define S1 and S2 sets as follows:

S1 = [r(s1 , a1 ), r(s1 , a2 ), ..., r(s1 , an )].

S2 = [r(s2 , a1 ), r(s2 , a2 ), ..., r(s2 , an )].

Obviously neither S1 and S2 are empty set, if S1 and S2 satisfy:

S1 △S2 = ∅.

Then when t = 2:
n
X r(s′ ,ai )
C2 = γ(C1 + β1 ln e β1
).
i=1

β2 = γ(β1 ).

So:
ϵ2 (s, a) ∼ Gumbel(C2 − γ max

(Q∗ (s′ , a′ )), β2 ).
a

This means this condition removes the correlation between Ci and (s, a) under
our assumption. So:

ϵt (s, a) ∼ Gumbel(Ct − γ max



(Q∗ (s′ , a′ )), βt ).
a

Where
n
X r(s′ ,ai )
Ct = γ(Ct−1 + βt−1 ln e βt−1
)(t ≥ 2).
i=1

Appendix A.4. Proof for Lemma 5:

Lemma 5. For random variables X ∼ Gumbel(CX , β) and Y ∼ Gumbel(CY , β),


if X and Y are independent, then (X − Y ) ∼ Logistic(CX − CY , β).

Proof. Let p1 (X), p2 (Y ) as the PDF for X, Y . P1 (X), P2 (Y ) as the CDF for
X, Y .
Z +∞ Z Y +z Z +∞
P (X−Y < z) = P (X < Y +z) = p1 (X)p2 (Y )dXdY = P1 (Y +z)p2 (Y )dY.
−∞ −∞ −∞

49
So:
Z +∞ Z +∞ Y +z−CX Y −CY
− 1 −( Y −C Y −
β
e−e
β +e )
P1 (Y + z)p2 (Y )dY = e β dY,
−∞ −∞ β
Z +∞ Y +z−CX Y −CY Z +∞ Y −CY CX −CY −z
1 − Y −CY − 1 Y −CY −
e−( e−
β
e−e e−e
β +e ) β (1+e β )
β dY = β dY,
β −∞ β −∞
Y −CY
Take U = e− β , then dU = − β1 U dY , then:
Z +∞ Y −CY CX −CY −z Z +∞ CX −CY −z
1 −
Y −CY
−e
− 1
e−U (1+e
β (1+e β ) β )
e β e dY = dU = CX −CY −z ,
β −∞ 0 (1 + e β )
So, we show that:
1 1
P (X − Y < z) = CX −CY −z = z−(CX −CY )
.

(1 + e β ) (1 + e β )
According to Section 3.4, we know that:

X − Y ∼ Logistic(CX − CY , β).

855

Appendix A.5. Proof for Lemma 6:


Lemma 6. If the random variable X ∼ Gumbel(A, 1), then E[e−X ], E[Xe−X ]
can all be bounded:
(1)
20 1
1 1
E[e−X ] < ( + 10e −e 2
+ − )e−A .
e2 2 2e
(2)
When A > 0:
3 20 1
1 1
E[Xe−X ] < ( + A( 2 + 10e−e 2 + − ))e−A .
20 e 2 2e
When A ≤ 0:
3 −A
E[Xe−X ] < ( )e .
20
Proof. Let us assume that X ∼ Gumbel(0, 1) first. According to Section 3.4,
we have known that:
−X
X ∼ Gumbel(0, 1), p(X) = e−(X+e )
.

50
For (1):
Z +∞ Z +∞ Z +∞
−X −X −X −(X+e−X ) −X
E[e ]= e p(X)dX = e ·e dX = e−(2X+e )
dX.
−∞ −∞ −∞

For (2): Z +∞
−X
−X
E[Xe ]= Xe−(2X+e )
dX.
−∞
860 We split this integral into the parts for X > 0 and X < 0 for separate discussions
now.
When X > 0, it is easy to see that:

e2X > eX → e−2X < e−X → e−2X − e−X < 0.

So we will have:
−X +∞ +∞ +∞
e(−2X−e )
Z Z Z
−2X
−e−X ) −X −2X −2X

(−2X−e −2X = e(e <1→ e−(2X+e )


dX < e−(2X+e )
dX < e−(2X+e )
dX.
e ) 0 0 −∞

In fact: Z +∞
−2X 1 1
e−(2X+e )
dX = − .
0 2 2e
On the other hand, obviously:
Z +∞ Z +∞
−X −2X
Xe−(2X+e ) dX < Xe−(2X+e )
dX.
0 0

When we take X < −5, for (1), we will have:


−X −5 −5
e(−2X−e
Z Z
) −0.1X
−e−X ) −X −0.1X
= e(−1.9X+e <1→ e−(2X+e )
dX < e−(0.1X+e )
dX.
e(−0.1X−e−0.1X ) −∞ −∞

In fact: Z −5 1
−0.1X
e−(0.1X+e )
dX = 10e−e 2 .
−∞
For (2), if we take X < 0:
−X
e(−2X−e ) −2X
−e−X ) −X −2X

(−2X−e −2X = e(e > 1 → Xe(−2X−e ) < Xe(−2X−e ).


e )
Z 0 Z 0
−(2X+e−X ) −2X
Xe dX < Xe−(2X+e )
dX.
−∞ −∞
−X
When −5 ≤ X ≤ 0, let U (X) = 2X + e :
dU
= 0 → x = −ln2 → min(U (x)) = 2 − 2ln2 → max(−U (x)) = 2ln2 − 2.
dX

51
Z 0
−X 20
e−(2X+e )
dX ≤ 5(eln4−2 ) = .
−5 e2
So, we can easily observe that:
Z +∞ Z −5 Z 0 Z +∞
−X −X −X −X
e−(2X+e ) dX = e−(2X+e ) dX+ e−(2X+e )
dX+ e−(2X+e )
dX.
−∞ −∞ −5 0

So: Z +∞
−X 20 1
1 1
e−(2X+e )
dX <
2
+ 10e−e 2 + − .
−∞ e 2 2e
Z +∞ Z 0 Z +∞
−X −2X −2X
Xe−(2X+e )
dX < Xe−(2X+e )
dX + Xe−(2X+e )
dX.
−∞ −∞ 0
The expectation of the Gumbel distribution is known. In fact, if X ∼ Gumbel(A, B),
then E[X] = A + vB where v ≈ 0.5772 < 0.6 represent the Euler–Mascheroni
constant. This has already been discussed in Section 3.4. In summary:
Z +∞
−X 20 1
1 1
e−(2X+e ) dX < 2 + 10e−e 2 + − .
−∞ e 2 2e
Z +∞
−X 1 1 3 3
Xe−(2X+e ) dX < ( )v < ( ) = .
−∞ 4 4 5 20
These are the boundaries when X follows a Gumbel(0, 1) distribution. Now,
let’s consider the case when X follows a Gumbel(A, 1) distribution. If X ∼
Gumbel(A, 1), according to Lemma 2, X − A ∼ Gumbel(0, 1), then E(e−(X−A) )
can be bounded:
20 1
−e 2 1 1 −X 20 1
−e 2 1 1
E(eA−X ) < + 10e + − → E[e ] < ( + 10e + − )e−A .
e2 2 2e e2 2 2e
3 3
E((X − A)eA−X ) < . → eA E[Xe−X ] − AeA E[e−X ] < .
20 20
3
eA E[Xe−X ] < + AeA E[e−X ].
20
So when A > 0:
3 20 1
1 1 3 20 1
1 1
eA E[Xe−X ] < +A( 2 +10e−e 2 + − ) → E[Xe−X ] < ( +A( 2 +10e−e 2 + − ))e−A .
20 e 2 2e 20 e 2 2e
But when A < 0, noticed that E[e−X ] > 0, so we have:
3 3
eA E[Xe−X ] < → E[Xe−X ] < ( )e−A .
20 20

52
Appendix A.6. Proof for Theorem 1:
Theorem 1. (Logistic distribution for Bellman error): The Bellman er-
ror εθ (s, a) approximately follows the Logistic distribution under the Assump-
tions 1-4. The degree of approximation can be measured by the upper bound
of KL divergence between:
n
X r(s′ ,ai )+Cθ (s′ ,ai )
X ∼ Gumbel(βθ ln e βθ
, βθ ).
i=1

and
n
X r(s′ ,ai )+Cθ (s′ ,ai )
Y ∼ Gumbel(γβθ ln e βθ
, γβθ ).
i=1
Pn r(s′ ,ai )+Cθ (s′ ,ai )
Let A∗ = ln i=1 e βθ
, we have these conclusions:
1
865 1. If A∗ > 0, then KL(Y ||X) < log( γ1 )+(1−γ)[A∗ ( 20
e2 +10e
−e 2
− 21 − 2e
1 3
)+ 20 −v].

2. If A∗ ≤ 0, then KL(Y ||X) < log( γ1 ) + (1 − γ)[ 20


3
− A∗ − v].

1
3. The order of the KL divergence error is controlled at O(log( 1−κ 0
) + κ0 A∗ ).

If the upper bound of KL divergence is sufficiently small. Then εθ (s, a) follows


the Logistic distribution, i.e.,
n
X r(s′ ,ai )+Cθ (s′ ,ai )
θ
ε (s, a) ∼ Logistic(Cθ (s, a) − βθ ln e βθ
, βθ ).
i=1

Proof. According to (14), we have had the definition for the Bellman error under
the setting of parameter θ:

εθ (s, a) = Q̂θ (s, a) − r(s, a) − γ max



(Q̂θ (s′ , a′ )).
a

Where:
Q̂θ (s, a) = Q∗ (s, a) + ϵθ (s, a).

From the proof of Lemma 4, we have already known that:

ϵθ (s, a) ∼ Gumbel(Cθ (s, a) − γ max



(Q∗ (s′ , a′ )), βθ ).
a

So:
εθ (s, a) = Q∗ (s, a) + ϵθ (s, a) − r(s, a) − γ max

(Q̂θ (s′ , a′ )).
a

53
Because of:
Q∗ (s, a) = r(s, a) + γmaxa′ Q∗ (s′ , a′ ).

So:

εθ (s, a) = γ max

[Q∗ (s′ , a′ )] + ϵθ (s, a) − γ max

[Q∗ (s′ , a′ ) + ϵθ (s′ , a′ )].
a a

870 Notice that this equation has two parts: (1)γ maxa′ [Q∗ (s′ , a′ )] + ϵθ (s, a) and
(2)γ maxa′ [Q∗ (s′ , a′ ) + ϵθ (s′ , a′ )]. Let us discuss them separately.
Let us first analyze part (1), according to Lemma 2, it is easy to have:

γ max

[Q∗ (s′ , a′ )] + ϵθ (s, a) ∼ Gumbel(Cθ (s, a), βθ ).
a

For another part (2):

ϵθ (s′ , ai ) ∼ Gumbel(Cθ (s′ , ai ) − γ max



(Q∗ (si ′′ , a′ )), βθ ).
a

Thus using Lemma 2, we have:

[Q∗ (s′ , ai )+ϵθ (s′ , ai )] ∼ Gumbel(Cθ (s′ , ai )−γ max



(Q∗ (si ′′ , a′ ))+Q∗ (s′ , ai ), βθ ).
a

Because of:
−γ max

(Q∗ (si ′′ , a′ )) + Q∗ (s′ , ai ) = r(s′ , ai ).
a

So:

Li = [Q∗ (s′ , ai ) + ϵθ (s′ , ai )] ∼ Gumbel(Cθ (s′ , ai ) + r(s′ , ai ), βθ ).

In the proof of Lemma 4, the independence of Li has already been taken into
account, therefore, using Lemma 3, we can know that:
n
X r(s′ ,ai )+Cθ (s′ ,ai )
maxai [Q∗ (s′ , ai ) + ϵθ (s′ , ai )] ∼ Gumbel(βθ ln e βθ
, βθ ).
i=1

According to the proof of Lemma 4, maxai [Q∗ (s′ , ai )+ϵθ (s′ , ai )] and γ maxa′ [Q∗ (s′ , a′ )]+
ϵθ (s, a) are independent under the same parameter θ. Now we want to use the
Lemma 5, according to Lemma 2, noticed that:
n
X r(s′ ,ai )+Cθ (s′ ,ai )
γmaxai [Q∗ (s′ , ai ) + ϵθ (s′ , ai )] ∼ Gumbel(γβθ ln e βθ
, γβθ ).
i=1

54
γ max

[Q∗ (s′ , a′ )] + ϵθ (s, a) ∼ Gumbel(Cθ (s, a), βθ ).
a

Thus we cannot use Lemma 5 directly because the scale parameters are not the
same even though they are independent, so we need to give an approximation
with certain error conditions now.
Assume that:

X = maxai [Q∗ (s′ , ai ) + ϵθ (s′ , ai )] ∼ Gumbel(A, B).

Y = γmaxai [Q∗ (s′ , ai ) + ϵθ (s′ , ai )] ∼ Gumbel(γA, γB).

where:
n
X r(s′ ,ai )+Cθ (s′ ,ai )
A = βθ ln e βθ
.
i=1

B = βθ .
n r(s′ ,ai )+Cθ (s′ ,ai )
A X
A∗ = = ln e βθ
.
B i=1

Let us see the KL divergence between these two distributions, we treat the
PDF for Gumbel(A, B) and Gumbel(γA, γB) as p(x) and q(x), according to
Section 3.4, we have shown that:

1 −( x−A +e− x−A


B )
p(x) = e B .
B
x−γA
1 −( x−γA
γB +e

γB )
q(x) = e .
γB
According to the definition of KL divergence, we have:
x−γA
x−γA −
1 −( γB +e
γB )
q(x) γB e
KL(q(x)||p(x)) = Ex ∼q(x) [log( )] = Ex ∼q(x) [log( − x−A
)],
p(x) 1 −( x−A
B +e
B )
Be

1 x − γA x−γA x−A x−A


KL(q(x)||p(x)) = Ex ∼q(x) [log( )−( + e− γB ) + ( + e− B )],
γ γB B
1 1 1 A x x
KL(q(x)||p(x)) = log( ) + ( − )Ex∼q(x) [x] + e B Ex∼q(x) [e− B − e− γB ].
γ B γB
875 Where 0 < γ < 1 is the discounted factor, according to the Assumption 4, 1−γ =
κ < κ0 ≤ δ0 with sufficiently small δ0 .

55
Using Lemma 2, we have shown that if x ∼ Gumbel(γA, γB), then x′ =
x
γB
A
∼ Gumbel( B , 1). dx′ = 1
γB dx. So:
Z +∞ x−γA Z +∞
x 1 −( x−γA −
γB ) − x ′ A −(x′ − A ) ′ ′
Ex∼q(x) [e− γB ] = e γB +e e γB dx = e−(x − B +e B )
e−x dx′ = Ex′ [e−x ].
−∞ γB −∞

According to Lemma 6, we have

x ′ 20 1
1 1 A
Ex∼q(x) [e− γB ] = Ex′ [e−x ] < ( 2
+ 10e−e 2 + − )e− B .
e 2 2e
x ′ ′
Ex∼q(x) [xe− γB ] = Ex′ [γBx′ e−x ] = γBEx′ [x′ e−x ].

If A > 0, according to Lemma 6:

x 3 A 20 1
1 1 A 3 20 1
1 1 A
Ex∼q(x) [xe− γB ] < ( + ( 2 +10e−e 2 + − ))e− B γB = ( γB+γA( 2 +10e−e 2 + − ))e− B .
20 B e 2 2e 20 e 2 2e

If A ≤ 0, then:
x 3γB − A
Ex∼q(x) [xe− γB ] < ( )e B .
20
According to our assumption, this bound can be kept under a sufficiently small
−x
δ0 . Let H( 1t ) = Ex∼q(x) [e t ]. Using Lagrange’s mean value theorem, there can
be a l ∈ [γ, 1], satisfy:

H( B1 ) − H( γB
1
) 1
= H ′( ).
( B1 − 1
γB )
lB

Noticed that ( B1 − 1
γB ) < 0. Under our assumption, we have known that:
(1) A > 0:

1 x 3 20 1
1 1 A
H ′( ) = Ex∼q(x) [−xe− lB ] > −( γB + γA( 2 + 10e−e 2 + − ))e− B ,
lB 20 e 2 2e

So:

1 1 1 1−γ 3 γA 20 1
1 1 A
( − )H ′ ( ) < ( γ+ ( 2 + 10e−e 2 + − ))e− B .
B γB lB γ 20 B e 2 2e

(2) A ≤ 0:
1 x 3γB − A
H ′( ) = Ex∼q(x) [−xe− lB ] > −( )e B ,
lB 20
1 1 1 1−γ 3 A
( − )H ′ ( ) < ( γ)e− B .
B γB lB γ 20

56
Thus, we can rearrange the above equation to obtain:
(1) A > 0:

1 1 γ−1 1−γ 3 γA 20 1
1 1
KL(q(x)||p(x)) < log( )+ ( )Ex∼q(x) [x]+ ( γ+ ( 2 +10e−e 2 + − )).
γ B γ γ 20 B e 2 2e

(2) A ≤ 0:

1 1 γ−1 1−γ 3
KL(q(x)||p(x)) < log( ) + ( )Ex∼q(x) [x] + ( γ).
γ B γ γ 20

We have shown that Ex∼q(x) [x] = γA + γBv, where v is the Euler–Mascheroni


880 constant.
(1) A > 0:

1 A 1−γ 3 γA 20 1
1 1
KL(q(x)||p(x)) < log( )+(γ−1)( +v)+ ( γ+ ( 2 +10e−e 2 + − )).
γ B γ 20 B e 2 2e

(2) A ≤ 0:

1 A 1−γ 3
KL(q(x)||p(x)) < log( ) + (γ − 1)( + v) + ( γ).
γ B γ 20

Finally, summarizing the above equation yields the KL bound:


(1) A > 0:

1 A 20 1
1 1 3
KL(q(x)||p(x)) < log( ) + (1 − γ)[ ( 2 + 10e−e 2 − − ) + − v].
γ B e 2 2e 20

(2) A ≤ 0:
1 3 A
KL(q(x)||p(x)) < log( ) + (1 − γ)[ − − v].
γ 20 B
Let us prove that these two upper bounds are well-defined.
(1)A > 0:
1
Let f (A) = log( γ1 ) + (1 − γ)[ B ( e2 + 10e−e 2 −
A 20 1
2 − 1
2e ) + 3
20 − v]. Obviously
f (A) > f (0), where:

1 3 1 9
f (0) = log( ) + (1 − γ)[ − v] > log( ) + (γ − 1)[ ] = g(γ),
γ 20 γ 20
∂g 1 9
=− + < 0.
∂γ γ 20
So:
f (A) > f (0) = g(γ) > g(1) = 0.

57
(2)A ≤ 0:
Let f (A) = log( γ1 ) + (1 − γ)[ 20
3
− A
B − v]. Obviously, it still holds that
f (A) ≥ f (0), it is consistent with the above discussion. So:

f (A) ≥ f (0) = g(γ) > g(1) = 0.

885 Therefore, these two bounds are well-defined and meaningful, they indicate
that two distributions can be considered approximately identical within the KL
divergence error limit. It is obvious that when γ = 1, KL(q(x)||p(x)) = 0.
Next, let’s discuss the order of this error, as defined, κ is in a small neighbor-
hood near the zero with the radius κ0 , then the growth order for KL divergence
KL(q(x)||p(x)) is:
1 A
O(log( ) + κ0 ).
1 − κ0 B
Within this error control range, we consider that γ does not affect the distribu-
tion type and coefficient magnitude, allowing us to apply Lemma 5 now.
According to Lemma 5:
n
X r(s′ ,ai )+Cθ (s′ ,ai )
εθ (s, a) ∼ Gumbel(Cθ (s, a), βθ ) − Gumbel(βθ ln e βθ
, βθ ),
i=1

Which means:
n
X r(s′ ,ai )+Cθ (s′ ,ai )
εθ (s, a) ∼ Logistic(Cθ (s, a) − βθ ln e βθ
, βθ ).
i=1

The discussion of Remark 2 is consistent with Lemma 4. The Remark 2 removed


the connection between Cθ and pair (s, a), so in this case:
n
X r(s′ ,ai )+Cθ
θ
ε (s, a) ∼ Logistic(Cθ − βθ ln e βθ
, βθ ).
i=1

So, we have shown that:


n
X r(s′ ,ai )
θ
ε (s, a) ∼ Logistic(−βθ ln e βθ
, βθ ).
i=1

890

58
Appendix A.7. Proof for Theorem 2

Theorem 2. (Positive Scaling upper bounds under Remark 2): Denote


r+ and r− the positive and negative rewards with r > 0 and r < 0, respectively.
With i1 + i2 + i3 = n, assume that:
n i1 i2
X r(s′ ,ai ) X r + (s′ ,ai ) X r − (s′ ,ai )
e βθ
= e βθ
+ e βθ
+ i3 .
i=1 i=1 i=1

895 If it satisfies

1. i1 ̸= 0,
Pi1 r + (s′ ,ai ) Pi2 r − (s′ ,ai )
2. i=1 e
βθ
r+ (s′ , ai ) + i=1 e
βθ
r− (s′ , ai ) < 0,

then there exists an optimal scaling ratio φ∗ > 1, such that for any scaling
ratio φ that can effectively reduce the expectation of the Bellman error, it must
satisfy
1 ≤ φ ≤ φ∗ .

Proof. According to our discussion, εθ (s, a) should satisfy:


n
X r(s′ ,ai )
εθ (s, a) ∼ Logistic(−βθ ln e βθ
, βθ ).
i=1

Pn r(s′ ,ai )
According to condition (1):i1 ̸= 0, which means −βθ ln i=1 e
βθ
< 0, so if
the scaling factor φ is effective, it should satisfy:
n
X r(s′ ,ai )
n
X φr(s′ ,ai )
−βθ ln e βθ
≤ −βθ ln e βθ
.
i=1 i=1

This is because our target is 15, then we will have:


n
X r(s′ ,ai )
n
X φr(s′ ,ai )
ln e βθ
≥ ln e βθ
,
i=1 i=1

Using our decomposition form, we will have:


i1 i2 i1 i2
X r + (s′ ,ai ) X r − (s′ ,ai ) X φr + (s′ ,ai ) X φr − (s′ ,ai )
ln[ e βθ
+ e βθ
+ i3 ] ≥ ln[ e βθ
+ e βθ
+ i3 ],
i=1 i=1 i=1 i=1

59
which means:
i1 i2 i1 i2
X r + (s′ ,ai ) X r − (s′ ,ai ) X φr + (s′ ,ai ) X φr − (s′ ,ai )
e βθ
+ e βθ
≥ e βθ
+ e βθ
.
i=1 i=1 i=1 i=1

Pi1 φr + (s′ ,ai ) Pi2 φr − (s′ ,ai )


Let G(φ) = i=1 e
βθ
+ i=1 e
βθ
.
i1 i2
∂G X φr + (s′ ,ai ) X φr − (s′ ,ai )
= e βθ r+ (s′ , ai ) + e βθ r− (s′ , ai ).
∂φ i=1 i=1

∂G
Obviously is monotonically increasing w.r.t. φ. According to our assump-
∂φ
Pi1 r+ (sβ′ ,ai ) + ′ Pi2 r− (sβ′ ,ai ) − ′
tion, we have ∂G
∂φ (1) = i=1 e θ r (s , ai ) + i=1 e
θ r (s , ai ) < 0,
limx→∞ ∂G
∂φ (x) > 0. According to zero theorem, there exist a φ∗ ∈ (1, +∞),

satisfy ∂G
∂φ (φ ) = 0. when 1 ≤ φ ≤ φ∗ , G(1) ≥ G(φ). Which means:

i1 i2 i1 i2
X r + (s′ ,ai ) X r − (s′ ,ai ) X φr + (s′ ,ai ) X φr − (s′ ,ai )
G(1) = e βθ
+ e βθ
≥ e βθ
+ e βθ
= G(φ).
i=1 i=1 i=1 i=1

That is what we want to prove.

Appendix A.8. Sovling for Equation (8)

Because:

π(a′ |s′ )
X  
∗ ′ ′ ′ k ′ ′
π = arg max[ P(s |s, a)π(a |s ) Q (s , a ) − ζ log ].
π µ(a′ |s′ )

Subject to the following equality constraints:


X
π ∗ (a′ |s′ ) = 1.
a′

Then we construct the Lagrange function as:


" #
π(a′ |s′ )
X   X
′ k ′ ′′ ′ ′ ′
f (π, L) = P(s |s, a)π(a |s ) Q (s , a ) − ζ log +L π(a |s ) − 1 .
µ(a′ |s′ )
s′ ,a ′ ′ a

Take
∂f ∂f
= 0, = 0.
∂L ∂π
We have:
Qk (s,a)+L−1
µ(a|s)e ζ = π(a|s).

60
L−1 X Qk (s,a) X Qk (s,a)
log(e ζ µ(a|s)e ζ ) = log(1) = 0 → L = 1 − log µ(a|s)e ζ .

Then take L back, we can have:


k
µ(a|s)eQ (s,a)/ζ
π ∗ (a|s) = P Qk (s,a)/ζ
.
a µ(a|s)e

900 Appendix A.9. Proof for Theorem 4

Theorem 4. (Relationship between LLoss and MSELoss) The MSELoss can


be used as an approximate estimation of LLoss(0, σ) when ε is sufficiently small,
i.e.,
1
LLoss(0, σ) = ln4 + MSELoss + o(ε3 ),
2
where o(ε3 ) third-order infinitesimal of ε when ε is sufficiently small.

ε
Proof. Let σ = t, then:
1 2
MSELoss = t .
2
LLoss(0, σ) = t + 2log(1 + e−t ).

If we performed a Taylor expansion on LLoss, we can have:

1 1 1 1
LLoss = t+2[ln2− t+ t2 +o(t3 )] = ln4+ t2 +o(t3 ) = ln4+ MSELoss+o(t3 ).
2 8 4 2

905

Appendix A.10. Proof for Lemma 8

Lemma 8. The sampling error Se in (16) can be decomposed into Bias and
Variance terms. If we define:
h i
(x ,x ,...,xN )
F (t) = E(x1 ,x2 ,...,xN ) F̂N 1 2 (t) ,

then
Se = Et [Variance(t) + Bias(t)] = Variance + Bias,

where
h i
(x ,x ,...,xN ) (x ,x ,...,xN )
Variance(t) = E(x1 ,x2 ,...,xN ) (F̂N 1 2 (t))2 − E2(x1 ,x2 ,...,xN ) [F̂N 1 2 (t)],

Bias(t) = (F (t) − F (t))2 .

61
Proof. According to (16):

(x ,x2 ,...,xN ) (x ,x2 ,...,xN )


Se = Et E(x1 ,x2 ,...,xN ) [(F̂N 1 (t))2 − 2F̂N 1 (t)F (t) + F 2 (t)],

Take F (t) in, we have:

(x ,x2 ,...,xN ) 2 2
Se = Et E(x1 ,x2 ,...,xN ) [(F̂N 1 (t))2 − 2F (t)F (t) + F 2 (t) + F (t) − F (t)],

which means:

(x ,x2 ,...,xN ) 2
Se = Et E(x1 ,x2 ,...,xN ) [(F̂N 1 (t))2 − F (t) + (F (t) − F (t))2 ].

So Se can be rewritten with Bias-Variance decomposition, i.e.,

Se = Et [Variance(t) + Bias(t)] = Variance + Bias.

910 where
h i  2
(x ,x ,...,xN ) (x ,x ,...,xN )
Variance(t) = E(x1 ,x2 ,...,xN ) (F̂N 1 2 (t))2 − E(x1 ,x2 ,...,xN ) [F̂N 1 2 (t)] ,

Bias(t) = (F (t) − F (t))2 .

Appendix A.11. Proof for Theorem 3

Theorem 3. (The Expectation of order statistics for the Logistic dis-


tribution)
N −i
"
i−1
#
X 1 X1
E[x(i) ] = B[ − ]+A .
k k
k=1 k=1

915 Proof. As we have known that:


Z +∞ (t−A)
N! (e− B )N +1−i
E[x(i) ] = tdt
B(i − 1)!(N − i)! −∞ (1 + e− (t−A)B )N +1
Z +∞
N! (e−g )N +1−i
= (gB + A)dg.
(i − 1)!(N − i)! −∞ (1 + e−g )N +1
t−A
where we define B = g, then we have: Bdg = dt. We divide the integral term
to L1 (N, i) and L2 (N, i) simplifies. So:

N!
E[x(i) ] = [BL1 (N, i) + AL2 (N, i)] .
(i − 1)!(N − i)!

62
where:
Z +∞ +∞ Z +∞
(e−g )N +1−i (e−g )N +1−i (e−g )N +1−i
Z
(gB+A)dg = B gdg +A dg .
−∞ (1 + e−g )N +1 −∞
−g
(1 + e ) N +1
−∞ (1 + e
−g )N +1
| {z } | {z }
L1 (N,i) L2 (N,i)

Let us calculate these two parts:


Z +W Z +∞
(e−g )N +1−i (e−g )N +1−i
L2 (N, i) = lim dg = dg.
W →+∞ −W (1 + e−g )N +1 −∞ (1 + e
−g )N +1

Because:
Z +W
(e−g )N +1−i 1 +W −g N −i (e−g )N −i +W (N − i) +W (e−g )N −i
Z Z
1
dg = (e ) d( ) = | + dg.
−W (1 + e
−g )N +1 N −W (1 + e−g )N (1 + e−g )N −W N −W (1 + e
−g )N

Noticed that when (i < N ):

(e−g )N −i +W (eW )i (eW )(N −i)


lim −g N
|−W = lim W N
− W = 0.
W →+∞ (1 + e ) W →+∞ (e + 1) (e + 1)N

Noticed that:
Z +∞ +W
(e−g ) eW
Z
1 1 1 1 1
L2 (i, i) = −g i+1
dg = lim d( −g i
) = lim [( W
)i −( W )i ] = .
−∞ (1 + e ) W →+∞ i −W (1 + e ) i W →+∞ e + 1 e +1 i

So when i < N we have:


Z +∞
(e−g )N +1−i (N − i) +∞ (e−g )N −i (N − i)
Z
L2 (N, i) = −g N +1
dg = −g N
dg = L2 (N −1, i).
−∞ (1 + e ) N −∞ (1 + e ) N

When i = N we can have:


1
L2 (N, i) = .
N
So, in summary:

(N − i) (N − i − 1) (N − i − 2) 1 (N − i)!
L2 (N, i) = .... = (i − 1)!.
N N −1 N −2 i N!

Let us consider another part L1 (N, i):


+W +∞
(e−g )N +1−i (e−g )N +1−i
Z Z
L1 (N, i) = lim gdg = gdg.
W →+∞ −W (1 + e−g )N +1 −∞ (1 + e−g )N +1

Because:
+W +W
(e−g )N +1−i
Z Z
1 1
gdg = (e−g )N −i gd( ),
−W (1 + e−g )N +1 N −W (1 + e−g )N

63
Which means:
Z +W Z +W Z +W
(e−g )N +1−i 1 (e−g )N −i g +W (e−g )N −i g (e−g )N −i
gdg = [ | + (N −i) dg− dg].
−W (1 + e
−g )N +1 N (1 + e−g )N −W −W (1 + e−g )N −W (1 + e
−g )N

Take limW →+∞ , notice that when (i < N ):

(e−g )N −i g +W W (eW )i W (eW )(N −i)


lim −g N
|−W = lim W N
+ = 0.
W →+∞ (1 + e ) W →+∞ (e + 1) (eW + 1)N
Noticed that L1 (N, N ) + L1 (N, 1) = 0 and L1 (1, 1) = 0, this is because L1 (1, 1)
is the expectation of Logistic(0, 1). So:
N −1 1 1 1 1 1 1
L1 (N, 1) = L1 (N −1, 1)− =− [ + + +... ].
N N (N − 1) N N −1 N −2 N −3 1
So:
N −1
1 X 1
L1 (N, N ) = [ ].
N i=1 i
In addition, we have the following general iterative expression:
i−1 N −i
N −i 1 (N − i)! X1 X1
L1 (N, i) = L1 (N −1, i)− L2 (N −1, i) = (i−1)![ − ].
N N N! k k
k=1 k=1

In conclusion, we have:
i−1 N −i
N! (N − i)! X1 X1 (N − i)!
E[x(i) ] = [B (i − 1)![ − ]+A (i − 1)!].
(i − 1)!(N − i)! N! k k N!
k=1 k=1

Which means:
i−1 N −i
X 1 X1
E[x(k) ] = [B[ − ] + A].
k k
k=1 k=1

920 Appendix B. Other experiment result

Appendix B.1. The evolving Bellman error distribution over training time

In this section, we present another three distributional images of Bellman


errors during training in different environments under the same settings as Fig-
ure 1, revealing a strong alignment with the Logistic distribution. These results
925 are presented in Figure B.12. All these results confirm the reliability of the
Logistic distribution.

64
Figure B.12: The evolving distributions of Bellman error computed by (14) at different
epochs of online RL training on three environments.

Appendix B.2. The variations of Bellman error distribution in online and of-
fline environments.

In this section, we will present all distribution details to confirm the char-
930 acteristics of the Logistic distribution, which is slightly shorter in the tail and
slightly longer in the head compared to the Gumbel distribution. In the head
region of the distribution, the Logistic distribution fits much better than the
Normal and Gumbel distributions, while in the tail region, the Logistic distri-
bution is superior to the Normal distribution and generally outperforms the
935 Gumbel distribution in most environments. These phenomena can be observed
from Figures B.14 and B.13. Figure B.14 provides a detailed view of the Bell-
man error distribution in the offline environment, while Figure B.13 displays
the detailed Bellman error distributions in the online environment.

65
Figure B.13: The distribution for Bellman error for the other online environment
during half of the training epochs.

Figure B.14: The distribution for Bellman error for the other offline environments
during the half of the training epochs.

66
Appendix B.3. The complete version of the containing goodness-of-fit tests and
940 KS tests for Bellman error
In this section, we will explain how to conduct the KS test. The specific
procedure involves the following three steps:

step 1 Collect the data as x ∈ [x1 , x2 , ...xn ] and compute its cumulative distri-
bution function as F ∗ (x). Fix the distribution to be tested as F (x).

step 2 Plot the cumulative distribution function F (x) of a specific distribution


945

(Gumbel/Logistic/Normal) under their optimal parameters. For example,


after importing the dataset, we first obtain the optimal fitting parameters
for the three distributions through fitting. We then use the distributions
corresponding to their respective optimal fitting parameters as F (x).

step 3 Calculate the size of the KS statistic by


950

KS = max |F ∗ (xi ) − F (xi )|.


i

A smaller KS statistic indicates a closer similarity between the cumulative dis-


tribution function of the data and the specified distribution function.

Appendix B.4. Sensitivity analysis.


In this section, based on the empirical results obtained by adjusting σ, we
955 provide a range of σ variations and their relationship with maximum average
reward and final training average reward. This section serves as a supplement
to the remaining experiments in Section 8.3.
From these figures, it is evident that within a certain range of σ variations,
LLoss outperforms MSELoss in both the maximum average reward and the av-
960 erage reward. This indicates that even within a small range of σ adjustments,
LLoss consistently yields superior results compared to MSELoss and exhibits a
degree of robustness. Figures B.15 and B.16 depict the variations in maximum
average reward and final average reward for different σ values in the offline
setting, with the dashed line representing the MSELoss standard for IQL. Fig-
965 ures B.17 and B.18 similarly show the variations in maximum average reward

67
and final average reward for different σ values in the online setting, with the
dashed line representing the MSELoss standard for SAC.

Figure B.15: The relationship between the variation of σ and the maximum average
reward in offline training.

Figure B.16: The relationship between the variation of σ and the average reward in
offline training.

68
Figure B.17: The relationship between the variation of σ and the maximum average
reward in online training.

Figure B.18: The relationship between the variation of σ and the average reward in
online training.

69

You might also like