Professional Documents
Culture Documents
Final Paper1
Final Paper1
Abstract
1. Introduction
2
Figure 1: Evolving distributions of Bellman error (as defined by (14)) at different
epochs in the online LunarLanderContinuous-v2 environment.
3
vative and practically significant, as it informs the design of more effective RL
algorithms. Our research makes significant contributions in several key areas:
• By exploring the sampling error of the Logistic distribution using the Bias-
50 Variance decomposition, we provide practical guidelines for optimal batch
sizing in neural network training, enhancing computational efficiency.
4
the numerical experimental analysis and ablation study for our method. Finally,
we summarize this paper in Section 9 and discuss the directions for future work.
2. Related Work
75 In this section, we will introduce some recent research that is relevant to our
work. We will summarize them in two parts. Regarding the Bellman equation
and the Bellman error, Extreme Q-Learning (XQL) [17] defines a novel sample-
free objective towards optimal soft-value functions in the maximum entropy RL
setting using the Gumbel distribution, they use the maximum likelihood func-
80 tion of the Gumbel distribution to avoid directly sampling the maximum en-
tropy. Their frameworks mark a significant departure from established practices
and offer exciting prospects for advancements in RL optimization techniques.
For example, Implicit Diffusion Q-learning (IDQL) [21] used the samples from
a diffusion-parameterized behavior policy to attain better results. Inverse Pref-
85 erence Learning [22] is proposed for learning from offline data without the need
for learning the reward. PROTO [23] is proposed to overcome some limitations
and achieve superior performance. In addition to this, Some researchers [24]
expressed skepticism towards MSELoss because it is a non-convex function and
employed an improved convex loss function as an alternative. However, they
90 did not fundamentally elucidate the inadequacies of the Normal distribution be-
hind MSELoss. The subsequent papers[25, 26, 27] made improvements in terms
of convexity. For [25], they express skepticism about the scheme of the Bell-
man equation, arguing from a convexity standpoint why the Bellman equation
may not be an ideal objective function, for [26] and [27], they focus on fur-
95 ther optimizing and enhancing convex objective functions, aiming to illustrate
that direct optimization using MSELoss combined with Bellman error is incor-
rect. However, none of them explain the inherent issues of the MSELoss from
a distributional perspective. Additionally, recent work [28] has also highlighted
the pessimistic outcomes of MSELoss on Bellman error performance for offline
100 RL and provided a reasoned explanation from an offline perspective under the
5
distribution of the relevant dataset.
The issue of reward scaling, which can be regarded as one of the parts
of reward shaping, is also one of the focal points in RL. Many scholars have
conducted numerous studies on this topic. Some researchers [29] gave the scaling
105 rules and scaling functions within the experience. Alternatively, some scholars
link the scaling problem to sampling complexity, aiming to enhance sample
sampling efficiency by setting scaling functions [30]. In addition to this, some
individuals choose to bypass manual scaling and instead focus on learning the
reward function [31, 32]. While these approaches to some extent shed light on
110 the issues of reward scaling and the agent performance, they did not explicitly
connect reward scaling with the Bellman error objective from a distributional
perspective and explain the reason for a saturation upper bound during scaling.
3. Preliminaries
6
immediate feedback. The agent is expected to learn progressively from these
130 feedbacks to facilitate the optimal strategies. In contrast, the interaction be-
tween the agent and the environment is unavailable in offline training, in which
case the agent will be advised to learn from a large offline dataset to recognize
intrinsic patterns that are expected to be generalized to similar environments.
While enhancing the generalizability is a more difficult problem in general, the
135 performance of offline RL is significantly inferior to their online counterparts.
140 Alternatively, Soft AC (SAC) [8, 9] encompasses soft conditions in future re-
wards to learn the policy π with a regularization strength ζ that maximizes:
" T #
X
Eat ∼π(at |st ) γ t (r(st , at ) − ζ log(π(at |st ))) . (2)
t=0
145 The reference distribution µ(a|s) follows different sampling conventions in dif-
ferent types of RL to fit the behavioral policy [34]. Specifically, in online RL, it
is usually sampled from a uniform distribution, while in offline RL, it is usually
sampled from the empirical distribution of the offline training data.
7
3.3. (Soft) Bellman Equation
150 The cumulative discounted reward can be used to formulate the optimal
Bellman iterative equation for Q-learning [35]. It is defined as:
For conciseness, we derive the equation from (3). The same method can be
directly applied to the other two variants in (1) and (2). All these objectives
rely on the Bellman iteration, which can be inspired from (4), i.e.,
155 The analysis in this research is based on (5). However, for completeness, we
also introduce other update methods. Consider the general form of the optimal
Bellman iterative equation from (3), which reads:
2
π(a′ |s′ )
k+1 ′ ′ k
Q (s, a) ← arg min r(s, a)+Es ∼P(·|s,a),a ∼π Q(s , a ) − ζ log
′ ′ −Q (s, a) .
Q µ(a′ |s′ )
(6)
The correspondence solution to the Bellman iteration with respect to (3) is then:
π(a′ |s′ )
k+1 k ′ ′
Q (s, a) = r(s, a) + Es′ ∼P(·|s,a),a′ ∼π Q (s , a ) − ζ log . (7)
µ(a′ |s′ )
To take the optimal strategy with the maximum Qt (s′ , a′ ) in (5), the corre-
160 sponding π ∗ has to satisfy:
π(a′ |s′ )
∗ ′ ′ k ′ ′
π (a |s ) = arg max(Es′ ∼P(·|s,a),a′ ∼π Q (s , a ) − ζ log ), (8)
π µ(a′ |s′ )
where a′ π ∗ (a′ |s′ ) = 1. Applying the Lagrange multiplier method [36] (see
P
π ∗ (a′ |s′ )
′ ′ Qk (s′ ,a′ )/ζ
X
Es′ ∼P(·|s,a),a′ ∼π∗ Qk (s′ , a′ ) − ζ log → E ′
s ∼P(·|s,a) ζ log µ(a |s )e .
µ(a′ |s′ )
a′
(10)
8
The maxa′ Qk (s′ , a′ ) in (5) with respect to the optimal policy π ∗ is:
k
(s′ ,a′ )/ζ
X
max
′
Qk (s′ , a′ ) = Es′ ∼P(·|s,a) [ζ log µ(a′ |s′ )eQ ]. (11)
a
a′
165 While it is challenging to estimate the log sum in (11), the XQL [17] employed
a Gumbel regression-based approach to circumvent the need for sampling esti-
mation.
In this section, we conducted our analyses under the most basic settings,
which means that we disregarded the impact of the reward distribution from the
dataset, neglected the influence of state transition probabilities, and imposed
a finite action space. Our purpose is to demonstrate that if we acknowledge
9
185 the suitability of using the Normal distribution or the Gumbel distribution for
Bellman error under all conditions, then modeling Bellman error with these
distributions should theoretically and experimentally be interpretable in such
a straightforward basic setting. However, we will show that the Bellman error
no longer conforms to either of these distributions, but rather follows a biased
190 Logistic distribution, as experimentally supported in Figure 1.
The specific structure of this section is outlined as follows. We initiate our
exploration in Section 4.1 by defining the Bellman error with parameterization θ
and analyzing the distribution of the Bellman error under Gumbel initialization.
While the Gumbel initialization is not applied as commonly as Normal initial-
195 ization in practice, we present the formulation of the Normal approximation
for the Gumbel distribution in Section 4.2. This approximation allows for the
substitution of Gumbel initialization in Section 4.1 with Normal initialization.
As is well known that there exists a gap between the iterated values and true
210 values. We now define the error between Q∗ (s, a) and Q̂t (s, a) as ϵt (s, a), which
means:
Q̂t (s, a) = Q∗ (s, a) + ϵt (s, a). (12)
10
According to the Bellman iteration in (5), each Q̂t (s, a) can be obtained through
iterative update from the initialization Q̂0 (s, a). We will show in Lemma 4
that the random variable ϵt (s, a) follows a biased Gumbel distribution under
215 Assumptions 1-3.
While (5) is capable of updating tabular Q-values, complex environments
often employ neural networks to parameterize the Q-function. We thus param-
eterize Q̂ and ϵ by θ. We refine the Q-function as Q̂θ (s, a) for the (s, a) pair
and represent the gap ϵt (s, a) in (12) as ϵθ (s, a), which revise in (12) to:
We will see the reasons for Assumption 1 in the upcoming Lemma 3. The
reason for this assumption is an infinite action space does not necessarily guar-
235 antee the effectiveness of the max operator for Gumbel distribution. In fact, this
assumption can be considered standard because it is quite common in practical
problems [37, 38].
11
Assumption 2. There is an injection mapping T : (s, a) → s′ , such that the
next state s′ is uniquely determined by the current state s and action a.
240 Assumption 2 is for the convenience of our theoretical analysis, as the state
transition probability P(s′ |s, a) can affect the analysis of the error distribution
that is not conducive to our analysis. This assumption aligns with practical
scenarios, especially when disregarding state transition probabilities [39, 40].
Therefore, we can also consider this assumption as standard.
245 Assumption 3. The initial Q̂0 (s, a) follows the same Gumbel distribution that
is independent from (s, a) pairs.
We will show in Lemma 1 and 3 that the direct way to obtain a true Gumbel
distribution and continue its type during iterations is to assume the initial state
is a Gumbel distribution. Lemma 1 shows that under finite conditions, obtaining
250 a true Gumbel distribution using other distributions is impossible.
In fact, using the Gumbel initialization is often not common. Instead, we
tend to prefer the Normal initialization, so a direct observation of Assumption 3
is not standard, it is a strict assumption here. Although the Gumbel initial-
ization does not align with our practical understanding, in Section 4.2, we will
255 provide a method to replace the Gumbel initialization with the Normal initial-
ization and give the standard Assumption 3∗ . Hence, Assumption 3 can also
be considered as a standard assumption.
12
In summary, Assumption 1-4 appear to be standard. Therefore, such as-
sumptions are reasonable. Based on the assumptions above, we have the asso-
ciated lemmas.
270 The key idea of Lemma 1 is that obtaining the true Gumbel distribution
using the maximum operator under a finite sample size n is generally impos-
sible. However, we can approximate the Gumbel distribution under Normal
conditions. We will delve into this discussion in Section 4.2. The following
Lemma 2-3 describe the basic properties of the Gumbel distribution.
The key idea of Lemma 2 indicates that the Gumbel distribution maintains
its distributional type under linear transformations. Lemma 3 demonstrates
that a sequence of independent Gumbel distributions scaled with the same con-
stant maintains their distributional type with the maximum operation.
285 It is worth noting that in conjunction with the Assumption 1 and 3, Lemma 3
shows us that γ maxa′ (Q̂0 (s′ , a′ )) ∼ Gumbel(C1 , β1 ) with constants C1 ∈ R, β1 >
0 that are determined by the initialization and are independent from (s, a) pair.
Based on this analysis, we next propose Lemma 4 to establish the relationship
between Q∗ and Q̂t for ϵt (s, a) defined in (12).
13
290 Lemma 4. For ϵt (s, a) defined in (12), under Assumptions 1-3, we show that:
where
C1 (s, a) = C1 ,
n
X r(s′ ,ai )
C2 (s, a) = γ(C1 (s, a) + β1 ln e β1
),
i=1
and
n
X r(s′ ,ai )+Ct−1 (s′ ,ai )
βt = γ t−1 β1 (t ≥ 1).
Remark 1. There is a special case when the Gumbel distribution follows a sim-
pler expression. For ∀s1 , s2 , define two sets S1 = [r(s1 , a1 ), r(s1 , a2 ), ..., r(s1 , an )]
and S2 = [r(s2 , a1 ), r(s2 , a2 ), ..., r(s2 , an )]. If S1 △S2 = ∅, then
295 with
n
X r(s′ ,ai )
Ct = γ(Ct−1 + βt−1 ln e βt−1
)(t ≥ 2),
i=1
βt = γ t−1 β1 (t ≥ 1).
The key idea of Lemma 4 indicates that under Assumptions 1-3, ϵt (s, a)
follows a Gumbel distribution, where the location parameter is associated with
the (s, a) pair, and the scale parameter is time-dependent. which contradicts
the assumption in [17] that E[ϵt (s, a)] = 0. This suggests that the independent
300 unbiased assumption of ϵt (s, a) in [17] is not adequately considered. Before
delving into the new theorem for Bellman error, we need Lemmas 5-6.
14
The key idea of Lemma 5 shows that subtracting two Gumbel distributions
305 with the same scale parameter results in a Logistic distribution. We will see
later that it plays a crucial role in the proof of Theorem 1. It is important
to note that X + Y will no longer follow the Logistic distribution. However,
it can be approximated by Generalized Integer Gamma (GIG) or Generalized
Near-Integer Gamma (GNIG) distributions [44].
Lemma 6. If X ∼ Gumbel(A, 1), then both E[e−X ] and E[Xe−X ] are bounded:
20 1
1 1
E[e−X ] < ( 2
+ 10e−e 2 + − )e−A .
e 2 2e
When A > 0:
3 20 1
1 1
E[Xe−X ] < ( + A( 2 + 10e−e 2 + − ))e−A .
20 e 2 2e
When A ≤ 0:
3 −A
E[Xe−X ] < ( )e .
20
310 Lemma 6 provides the bounds for E[e−X ] and E[Xe−X ], these bound are
also prepared for Theorem 1. Note that the bounds presented in Lemma 6 are
upper bounds and do not represent the supremum.
Next, We will present Theorem 1 that defines the Logistic distribution of
the Bellman error εθ (s, a) (formulated in (14)). We parameterize Ct and βt in
315 Lemma 4 as Cθ and βθ , respectively. Theorem 1 is formulated as follows:
and
n
X r(s′ ,ai )+Cθ (s′ ,ai )
Y ∼ Gumbel(γβθ ln e βθ
, γβθ ).
i=1
15
1
1. If A∗ > 0, then KL(Y ||X) < log( γ1 )+(1−γ)[A∗ ( 20
e2 +10e
−e 2
− 12 − 2e
1 3
)+ 20 −v].
1
3. The order of the KL divergence error is controlled at O(log( 1−κ 0
) + κ0 A∗ ).
320 If the upper bound of KL divergence is sufficiently small. Then εθ (s, a) follows
the Logistic distribution, i.e.,
n
X r(s′ ,ai )+Cθ (s′ ,ai )
εθ (s, a) ∼ Logistic(Cθ (s, a) − βθ ln e βθ
, βθ ).
i=1
16
Figure 2: (a) The upper bound of KL divergence with respect to different A∗ ratios
following Theorem 1. (b) Demonstration with Pendulum-v1 training. The average
reward by Implicit-Q-Learning, with the Bellman error modeled by Normal and Lo-
gistic distributions, respectively.
1
the Deep Q Learning (DQN) to illustrate the phenomenon observed in our
experiments that LLoss can assist the agent in reaching a higher average reward
level during the early stages of training, thereby expediting the agent’s speed in
345 finding the optimal solution. The results are in Figure 3. We delve further into
the discussion in Section 8.
Figure 3: The DQN training results in two discrete action environments Lunar Lan-
der and CartPole-v1, the Logistic loss function can significantly reduce the training
cost in the early stages and greatly accelerate the training process.
1 https://github.com/hungtuchen/pytorch-dqn
17
verge toward zero, i.e.,
θ
E ε (s, a) → 0. (15)
18
In particular, define:
1−ν
Γ( 3 ) ν (C ν ) ν
θ = ν − 1, C = ( ν1 ) 2 , D0ν =
ν ν
,
Γ( ν ) 2Γ( ν1 )
ν θν ν(C ν ) ν ν 1
βN =[ ν
W 0 [ ν
(D0 N ) θν ]] ν ,
ν(C ) θ
1 1 1 1 1
D1 = −(1 − ) ν , D2ν = (1 − )(2 − ) ν 2 .
ν
ν C ν ν (C )
where W0 [·] is the real part of the Lambert W-function, and Γ(·) denotes the
375 Gamma function. For Γ(n + 12 ), it is defined:
√
1 (2n)! π
Γ(n + ) = .
2 n!4n
1 X2
f (X)|ν=2 = √ e− 2 .
2π
19
385 Assumption 3∗ . The initial Q̂0 (s, a) follows some standard Normal distri-
bution that is independent from (s, a) pairs.
In this way, Assumption 3 becomes standard. Then with the previous As-
sumptions 1, 2, and 4, we can extend Lemma 4 and Theorem 1 to Normal
initialization. To establish an intuitive understanding of the revised Theorem 1
390 and Lemma 4 under Normal initialization, we now present a toy example with
a finite state space.
Example 1. Consider a scenario with five states {si }5i=1 and an action space
of 5000 actions {aj }5000
j=1 . At each state si , taking any action aj results in
395 Example 1 presents a finite state space that can be stored in Q ∈ R5×5000 .
Then the Bellman equation can be uniquely optimized by
5−i
X
Q∗ (si , :) = 0.99k .
k=0
where Q(si , :) denotes the i-th row of Q with respect to the i-th state.
Figure 4 verifies Lemma 4 and Theorem 1 with this toy example by visual-
izing the first 4 iterations of the Bellman errors. We use Normal(0, 1) random
400 initialization for each element in the Q table and employ (5) for iterating. Define
where ϵt (s1 , :) (row 1) follows a Gumbel distribution, and εt (s1 , :) (row 2) follows
a Logistic distribution. While this toy example makes many simplifications
for illustrative purposes only, we will provide further validation in Section 8
on complex real-world environments, in which cases obtaining the optimal Q∗
405 values is generally impossible.
20
Figure 4: The distribution of ϵ(s, a) (row 1, in purple) and ε(s, a) (row 2, in blue)
in the first four iterations with a randomly initialized Q table. The former roughly
follow Gumbel distributions, and the latter follow Logistic distributions.
Gao et al. [46] explained this problem by policy gradient. This section demon-
410 strates a rational explanation for the reward scaling problem from the standpoint
of the distribution of Bellman errors. We will establish a natural connection be-
tween reward scaling and the expectation of the Logistic distribution.
We start by connecting the proportional reward scaling problem and the
Bellman error. Following Remark 2, we facilitate the analysis with the special
415 case of Theorem 1, where
n
X r(s′ ,ai )
θ
ε (s, a) ∼ Logistic(−βθ ln e βθ
, βθ ).
i=1
Theorem 2 (Positive Scaling upper bounds under Remark 2). Denote r+ and
r− the positive and negative rewards with r > 0 and r < 0, respectively. With
i1 + i2 + i3 = n, assume that:
n i1 i2
X r(s′ ,ai ) X r + (s′ ,ai ) X r − (s′ ,ai )
e βθ
= e βθ
+ e βθ
+ i3 .
i=1 i=1 i=1
If it satisfies:
21
425 1. i1 ̸= 0,
Pi1 r + (s′ ,ai ) Pi2 r − (s′ ,ai )
2. i=1 e βθ
r+ (s′ , ai ) + i=1 e βθ
r− (s′ , ai ) < 0,
then there exists an optimal scaling ratio φ∗ > 1, such that for any scaling ratio
φ that can effectively reduce the expectation of the Bellman error, it must satisfy
1 ≤ φ ≤ φ∗ .
that there are always positive rewards in these 5000 samples, which satisfies the
first condition in Theorem 2 that i1 ̸= 0. Meanwhile, for βθ = (0.5, 1, 2, 3),
Pi1 r+ (sβ′ ,ai ) + ′ Pi2 r− (sβ′ ,ai ) − ′
440
i=1 e
θ r (s , ai ) + i=1 e θ r (s , ai ) ≈ (−88, −105, −109, −117).
In other words, the second condition in Theorem 2 holds for all the βs under
investigation. Furthermore, the optimal φ∗ suggested in Theorem 2 can be ob-
served from the figure. In each of the 4 scenarios in Figure 5, the E[εθ (s, a)]
reaches its upper bound halfway through increasing the scaling factor φ. Based
445 on the empirical observation, we conclude that when the error variance is con-
siderably small, a scaling ratio of 10 ∼ 50 is recommended. Note that this
observation is consistent with the experimental results in [9].
Example 2 explains the existence of the upper bound on the scaling ratio.
It comprehends from a distributional perspective for enhancing the model per-
450 formance during training.
22
Figure 5: The change of E[εθ (s, a)] by assigning different βs. An optimal scaling ratio
φ∗ exists in all the scenarios.
This section delves into the considerations of batch size in neural networks,
building upon the theorem established in Section 4 that the Bellman error con-
forms to a Logistic distribution. As the direct application of tabular Q-Learning
455 proves inadequate for complex environments, the extension of established the-
orems, such as training a neural network, becomes crucial for gaining practical
significance. In this context, we explore the empirical choice of the batch size N
used in sampling Bellman errors for parameter updates. Our goal is to regulate
the error bound while maintaining computational efficiency. We employ the
460 Bias-Variance decomposition to analyze the sampling distribution and substan-
tiate the identification of a suitable N ∗ .
Firstly, we outline the problem we are addressing in this section.
To address this issue, we first define the empirical distribution function for
470 sampling. For {x1 , x2 , ..., xN } sampled from Logistic(A, B), the associated em-
pirical distribution function for this sequence is
N
(x ,x2 ,...,xN ) 1 X
F̂N 1 (t) = 1x ≤t .
N i=1 i
23
Figure 6: The differences between F (t) (true CDF of Logistic(0, 1)) and F (t) (empirical
CDF) with varying sample sizes N .
Following Definition 2, we denote F (t), f (t) as the CDF and PDF of the Logistic(A, B)
(A replaces λ, and B replaces η). The sampling error Se reads:
h i
(x ,x ,...,xN )
Se = Et E(x1 ,x2 ,...,xN ) (F̂N 1 2 (t) − F (t))2 . (16)
Lemma 8. The sampling error Se in (16) can be decomposed into Bias and
Variance terms. If we define:
h i
(x ,x ,...,xN )
F (t) = E(x1 ,x2 ,...,xN ) F̂N 1 2 (t) ,
475 then
Se = Et [Variance(t) + Bias(t)] = Variance + Bias,
where
h i
(x ,x ,...,xN ) (x ,x ,...,xN )
Variance(t) = E(x1 ,x2 ,...,xN ) (F̂N 1 2 (t))2 − E2(x1 ,x2 ,...,xN ) [F̂N 1 2 (t)],
24
Here x(i) denotes the i-th order statistics. To find E[x(i) ] for each x(i) , we
perform piecewise segmentation [47] on the PDF of each x(i) , which reads
N!
f x(i) (t) = (F (t))i−1 (1 − F (t))n−i f (t).
(i − 1)!(N − i)!
Theorem 3 reveals the method for computing the expectation of order statistics
under the Logistic distribution.
485 Theorem 3 (The Expectation of order statistics for the Logistic distribution).
N −i
" i−1
#
X 1 X1
E[x(i) ] = B[ − ]+A .
k k
k=1 k=1
495 where E[x(i) ] follows the definition in Theorem 3 with a fixed size N . We can
then obtain the upper and lower limits of the integral in (17), leading to a direct
25
Table 1: The relationship between the sample size N and sampling error Se using (17).
2 4 8 16 32 64 128 256
2 We employ symbolic integration with the built-in ‘int’ function in MATLAB. Instead
of using numerical integration techniques, we leverage indefinite integral to achieve a direct
numerical result.
26
As the typical choice in deep RL networks for Q-updating, MSELoss is
based on the assumption that the estimation error follows a Normal distri-
520 bution Normal(0, σ). MSELoss is derived from the maximum likelihood estima-
tion function, if we sample n samples from the Bellman error and treat them
as εi (i = 1, 2, ..., n), then we have this log-likelihood function for the Normal
distribution:
n n n
Y √ X ε2i X 1
log[ p(εi )] = −n log( 2πσ) − 2
∝ − (εi )2 . (18)
i=1 i=1
2σ i=1
2
In Section 4.1, we have deduced that the Bellman error should follow a
525 biased Logistic distribution. Estimating the expectation for this distribution is
not straightforward for a neural network. We assume the Bellman error follows
Logistic(µ, σ) and derive the associated likelihood function as a replacement for
MSELoss.
We start from the PDF of εi ∼ Logistic(µ, σ), which reads:
−εi +µ
1 e σ
p(εi ) = . (19)
σ (1 + e −εσi +µ )2
We have demonstrated in Figure 1 (also see Appendix B.1 for additional vi-
sualizations) that the distribution of Bellman error evolves along training steps
and exhibits a stronger fit to the logistic distribution. Figure 7 further com-
535 pares the closeness of empirical Bellman error to Logistic, Normal, and Gum-
bel distributions. In all four environments, the Logistic distribution performs
better in fitting the empirical Bellman error. In addition to these visualized
comparisons, numerical evaluations will be provided later in Table 7-8 with
Kolmogorov–Smirnov (KS) statistic magnitudes [48].
27
Figure 7: The distribution of Bellman error. For all two online and two offline envi-
ronments, the Bellman errors fit better to the Logistic distribution than the Gumbel
and Normal distributions. More details are provided in Table 7-8.
540 We present the updating method for the Q network under LLoss for Deep-
Q-Network (DQN) [1] and SAC in Algorithm 1, we omit the algorithm’s main
body and solely present the improved sections. The omitted portions of the
algorithm are identical to the original DQN and SAC.
It is noteworthy that MSELoss and LLoss are strongly correlated when we
545 take µ = 0, with LLoss serving as a corrective function for MSELoss. The
following Theorem 4 reveals the relationship between MSELoss and LLoss when
ε is sufficiently small. In Section 8, we observe that setting µ = 0 has surpassed
the performance of MSELoss.
28
Algorithm 1 The updating method for the Q network in DQN and SAC.
Initialization: Qθ with random weights, Vϕ with random weights(for SAC);
Initialization: Time step T , Total episode M , Learning rate lr;
Initialization: Location parameter µ, Scale parameter σ, Scaling factor h;
for episode ← 1 to M do
for t ← 1 to T do
... ...(these steps are the same as DQN/SAC)
Use (14) to calculate each εi (for DQN);
Calculate εi = r(si , ai ) + γVϕ (s′i ) − Qθ (si , ai ) (for SAC);
Update θ ← θ − lr∇θ LLoss(µ, σ, θ)
... ...(these steps are the same as DQN/SAC)
end
σ ← σ × h(episode+1)
end
8. Experiment
29
Table 2: Hyperparameters setting for online training.
IQL (σ) Eval steps Train steps Expl. steps max. Step
30
Figure 8: The average reward of SAC, LSAC, and XQL in online training.
580 set the batch size to 256 and µ = 0, scaling factor h = 0.999 for both online and
offline RL. The decision to set µ as 0 is grounded in our experimental findings,
where the utilization of an LLoss model with µ = 0 demonstrated a significantly
superior performance compared to models using MSELoss. Tables 2 and 3 report
the details of initializations for both online and offline training, respectively. For
585 online RL, we validate the improvement of employing LLoss on SAC [9]and CQL
[13]. Consequently, we specify the associated σ initialization. For unspecified
settings, we adhere to the default setup in [9]. Similarly, in Table 3, we report
the σ initialization for IQL [16]. One point that needs special emphasis is the
“Expl. steps” in Table 3 represent the agent we need to specify the number of
590 task-agnostic environment steps. All the programs are sourced from rlkit 3 .
3 https://github.com/rail-berkeley/rlkit
31
8.2. Results Analysis
4 https://github.com/haarnoja/sac
5 https://github.com/aviralkumar2907/CQL
32
Figure 10: The average reward of IQL and LIQL in offline training.
Offline RL. As SAC is not suitable for offline training, we conducted improved
experiments based on the IQL components. We set the maximum iteration count
620 as 500 and incorporated a variance threshold of 5 to determine convergence for
50 epochs. The reason we use so few epochs is that we greatly reduce the
difficulty of each task. In other words, we set a maximum step size for the agent
instead of letting it run to the optimal solution. The method of controlling
variables is the same as in the online setting. Our algorithm is referred to
625 as LIQL. Due to some dimensional discrepancies between the IQL algorithm
provided by rlkit and the IQL algorithm, we use the improvement ratio relative
to the IQL baseline as the measure of algorithm performance. The change in
the average reward during training is depicted in Figure 10, and relevant details
are presented in Table 6. The results also indicate that our model exhibits the
33
Table 4: The maximum reward of online training over 10 random repetitions with
parentheses reporting the number of epochs (in hundreds) to achieve the results.
LunarLander-Continuous-v2 194.85 (900) 154.02 (1350) 211.90 (340) 221.75 (110) 156.72 (220)
HalfCheetah-v2 847.20 (1520) 739.38 (1500) 835.24 (1530) 856.14 (1300) 761.77 (1420)
Hopper-v4 628.20 (1510) 616.30 (430) 618.32 (1200) 635.09 (1100) 594.98 (140)
Walker2d-v2 427.70 (1340) 360.42 (1570) 327.14 (340) 465.08 (1270) 387.23 (1210)
HumanoidStandup-v4 15142.51 (1590) 15209.97 (1550) 13032.94 (1280) 23771.01 (760) 15487.68 (1230)
InvertedPendulum-v4 1001.00 (210) 1001.00 (280) 1001.00 (270) 1001.00 (190) 1001 (280)
InvertedDouble-Pendulum-v2 9359.82 (380) 9361.33 (510) 9360.56 (540) 9362.28 (380) 9363.40 (500)
BipedalWalker-v3 79.05 (1490) 80.11 (1560) 81.10 (1540) 82.53 (1330) 83.77 (1410)
Table 5: Average reward of online training over 10 random repetitions. The red values
indicate the enhancement of LLoss over its MSELoss counterparts.
34
Table 6: The average reward and enhanced ratio after the offline training, all algo-
rithms have calculated the enhancement ratios relative to the IQL.
conducted KS tests on the Bellman error for each environment. It includes sta-
635 tistical R2 and other statistical test parameters. The test results are presented
in Table 7 and Table 8. The KS tests results indicate that our assumption of
the Logistic distribution is more accurate than the other two distributions.
35
Table 7: The fitness and KS tests of Bellman-error for online RL.
Logistic Gumbel Normal Logistic Gumbel Normal Logistic Gumbel Normal Logistic Gumbel Normal
LunarLanderContinuous-v2 0.985 0.971 0.975 1.119 2.224 1.902 7.555 10.653 9.850 0.052 0.070 0.071
HalfCheetah-v2 0.991 0.990 0.989 1.344 1.425 1.549 8.282 8.405 8.888 0.026 0.047 0.033
Hopper-v4 0.989 0.985 0.981 2.697 3.793 4.807 11.734 13.912 15.661 0.067 0.073 0.085
Walker2d-v2 0.988 0.967 0.975 0.900 2.549 1.903 6.778 11.404 9.854 0.054 0.084 0.072
HumanoidStandup-v4 0.667 0.641 0.628 27.164 29.279 30.318 37.228 38.652 39.331 0.269 0.322 0.291
InvertedPendulum-v4 0.983 0.963 0.971 20.961 46.307 35.772 32.702 48.606 42.721 0.115 0.175 0.117
InvertedDoublePendulum-v4 0.999 0.981 0.998 0.249 5.623 0.324 3.959 16.938 4.063 0.021 0.079 0.023
BipedalWalker-v3 0.997 0.979 0.990 1.206 7.888 3.563 7.843 20.061 13.482 0.039 0.101 0.057
Logistic Gumbel Normal Logistic Gumbel Normal Logistic Gumbel Normal Logistic Gumbel Normal
hopper-medium-v2 0.981 0.975 0.976 10.175 13.975 13.191 29.617 34.709 33.722 0.040 0.094 0.053
walker2d-medium-v2 0.927 0.923 0.913 15.714 15.915 18.563 36.801 37.129 40.003 0.062 0.072 0.076
halfcheetah-medium-v2 0.836 0.831 0.833 9.087 9.394 9.223 12.314 12.941 12.444 0.050 0.052 0.069
halfcheetah-medium-replay-v2 0.852 0.813 0.836 12.864 16.242 14.221 33.301 37.416 35.012 0.031 0.105 0.048
walker2d-medium-replay-v2 0.950 0.908 0.937 10.514 19.256 13.262 30.101 40.743 33.812 0.075 0.149 0.093
hopper-medium-replay-v2 0.954 0.927 0.948 10.773 17.278 12.265 30.475 38.594 32.516 0.038 0.104 0.049
hopper-medium-expert-v2 0.985 0.970 0.982 8.054 17.097 10.162 26.351 38.391 29.598 0.056 0.105 0.059
walker2d-medium-expert-v2 0.981 0.959 0.973 11.186 23.807 15.584 31.053 45.302 36.653 0.067 0.138 0.075
halfcheetah-medium-expert-v2 0.919 0.869 0.913 12.722 20.568 13.719 33.117 42.108 34.391 0.036 0.098 0.045
man error and integrating the Logistic maximum likelihood function into the
associated loss function, we observed enhanced training efficacy in both online
650 and offline RL, marking a departure from the typical use of Normal or Gum-
bel distributions. Our theory’s validity is substantiated by rigorous analysis
and proofs, as well as empirical evaluations. Moreover, we naturally integrate
the Bellman error distribution with the reward scaling problem and propose a
sampling scheme based on this distribution for error limit control.
655 While we have introduced a novel avenue for improving RL optimization
focusing on the Bellman error, there remain compelling future directions for
exploration. For example, extending our analysis beyond the Bellman itera-
tive equation to include soft Bellman iterations could offer further insights. The
formulation of the state transition function might also benefit from a linear com-
660 bination of Gumbel distributions. Moreover, exploring innovative methods for
learning from an unknown biased distribution could be another promising direc-
tion, aligning with the inherently biased nature of the distribution of Bellman
36
Figure 11: The relationship between the variation of σ and the maximum average
reward/average reward in 4 environments (2 online and 2 offline).
error.
References
[4] Y.-D. Kwon, J. Choo, B. Kim, I. Yoon, Y. Gwon, S. Min, Pomo: Policy
675 optimization with multiple optima for reinforcement learning, Advances in
Neural Information Processing Systems 33 (2020) 21188–21198.
[5] A. Hottung, Y.-D. Kwon, K. Tierney, Efficient active search for combina-
torial optimization problems, arXiv:2106.05126 (2021).
37
[6] J. Bi, Y. Ma, J. Wang, Z. Cao, J. Chen, Y. Sun, Y. M. Chee, Learning gen-
680 eralizable models for vehicle routing problems via knowledge distillation,
arXiv:2210.07686 (2022).
[15] J. Lyu, X. Ma, X. Li, Z. Lu, Mildly conservative q-learning for offline
705 reinforcement learning, arXiv:2206.04745 (2022).
38
[16] I. Kostrikov, A. Nair, S. Levine, Offline reinforcement learning with implicit
q-learning, arXiv:2110.06169 (2021).
710 [18] L. Baird, Residual algorithms: Reinforcement learning with function ap-
proximation, in: Machine Learning Proceedings 1995, Elsevier, 1995, pp.
30–37.
[23] J. Li, X. Hu, H. Xu, J. Liu, X. Zhan, Y.-Q. Zhang, Proto: Iterative
policy regularized offline-to-online reinforcement learning, arXiv preprint
725 arXiv:2305.15669 (2023).
39
[26] F. Lu, P. G. Mehta, S. P. Meyn, G. Neu, Convex q-learning, in: 2021
American Control Conference (ACC), IEEE, 2021, pp. 4749–4756.
[27] F. Lu, P. G. Mehta, S. P. Meyn, G. Neu, Convex analytic theory for convex
735 q-learning, in: 2022 IEEE 61st Conference on Decision and Control (CDC),
IEEE, 2022, pp. 4065–4071.
740 [29] M. Qian, S. Mitsch, Reward shaping from hybrid systems models in re-
inforcement learning, in: NASA Formal Methods Symposium, Springer,
2023, pp. 122–139.
40
[36] D. P. Bertsekas, Constrained optimization and Lagrange multiplier meth-
ods, Academic press, 2014.
41
[45] L. Zarfaty, E. Barkai, D. A. Kessler, Accurately approximating extreme
value statistics, Journal of Physics A: Mathematical and Theoretical 54 (31)
(2021) 315205.
[46] L. Gao, J. Schulman, J. Hilton, Scaling laws for reward model overopti-
790 mization, in: International Conference on Machine Learning, PMLR, 2023,
pp. 10835–10866.
795 [49] J. Fu, A. Kumar, O. Nachum, G. Tucker, S. Levine, D4rl: Datasets for
deep data-driven reinforcement learning, arXiv preprint arXiv:2004.07219
(2020).
42
Appendix A. Proof for Lemmas and Theorems.
So we have:
(α−(C+A))
−
P (Y < α) = P (X < α − C) = e−e B
,
( α −A) (α−DA)
α −e−
D −
P (Z < α) = P (X < )=e B
= e−e DB
,
D
Which means:
Y ∼ Gumbel(C + A, B).
Z ∼ Gumbel(DA, DB).
where
(A−Ci )
−
P (Xi < A) = e−e
β
.
43
Then
(A−C1 ) (A−C2 ) (A−Cn ) (A−Ci )
− − − Pn −
P (X1 < A, X2 < A, X3 < A, ..., Xn < A) = e−e ·e−e ...·e−e = e−
β β β e β
i=1 ,
Ci Ci
Ci
−A −A Pn − A +ln( n β
P
β i=1 e
eln( i=1 e
Pn
P (X1 < A, X2 < A, X3 < A, ..., Xn < A) = e−e = e−e = e−e
β e β β ) β )
i=1 ,
Ci
− 1 [A−β ln( n β )]
P
−e β i=1 e
P (max(Xi ) < A) = e ,
i
So this means
n
X 1
max(Xi ) ∼ Gumbel(β ln e β Ci , β).
i
i=1
where
C1 (s, a) = C1 ,
n
X r(s′ ,ai )
C2 (s, a) = γ(C1 (s, a) + β1 ln e β1
),
i=1
and
n
X r(s′ ,ai )+Ct−1 (s′ ,ai )
βt = γ t−1 β1 (t ≥ 1).
Proof. If we use the Bellman operator during updating at the t-th iteration for
estimating from (5), we will have:
44
By subtracting these two equations, it can be deduced that the error ϵt (s, a) at
the t-th step is:
ϵt (s, a) = γ[max
′
(Q̂t−1 (s′ , a′ )) − max
′
(Q∗ (s′ , a′ ))].
a a
ϵ1 (s, a) = γ max
′
(Q̂0 (s′ , a′ )) − γ max
′
(Q∗ (s′ , a′ )).
a a
γ max
′
(Q̂0 (s′ , a′ )) ∼ Gumbel(C1 , β1 ).
a
While γ maxa′ (Q∗ (s′ , a′ )) is a constant, not a random variable, so it does not
affect the Gumbel distribution type, but it affects the location of this Gumbel
distribution. According to Lemma 2, we will have:
Let us see what happens if we replace s with s′ . By assumption, the action space
A has finite elements, which means A = [a1 , a2 , ..., an ], so we can enumerate all
′′ ′′ ′′
actions to a list with the state-action pair: [(s′ , a1 , r1 , s1 ), (s′ , a2 , r2 , s2 ), ..., (s′ , an , rn , sn )],
′′
where si is gotten from T (s′ , ai ). According to the above discussion, we have:
′′
ϵ1 (s′ , ai ) ∼ Gumbel(C1 − γ max
′
(Q∗ (si , a′ )), β1 ).
a
Noticed that ϵ1 (s′ , ai ) and ϵ1 (s′ , aj ) are independent when i ̸= j, this is because
there is obviously no relationship between the two different actions ai and aj .
This is not a difficult fact to understand. In fact, we will show in Fact 1 that
for any different (s, a) pair, ϵ1 (s, a) will be independent.
820 Fact 1: For any different (s, a) pair, ϵ1 (s, a) will be independent.
This may be surprising, because according to the mapping T , this has es-
tablished a relationship between s′k and (s, ak ) with s′k = T (s, ak ).
Proof for the Fact 1:
45
For any two different pairs (s1 , ak ) and (s2 , aj ), define T (s1 , ak ) = s′1k and
T (s2 , aj ) = s′2j , noticed that:
ϵ1 (s1 , ak ) = γ max
′
(Q̂0 (s′1k , a′ ))−γ max
′
(Q∗ (s′1k , a′ )) ∼ Gumbel(C1 −γ max
′
(Q∗ (s′1k , a′ )), β1 ).
a a a
1
ϵ (s2 , aj ) = γ max
′
(Q̂ 0
(s′2j , a′ ))−γ max
′
(Q ∗
(s′2j , a′ )) ∼ Gumbel(C1 −γ max
′
(Q∗ (s′2j , a′ )), β1 ).
a a a
ϵ2 (s, a) = γ max
′
(Q∗ (s′ , a′ ) + ϵ1 (s′ , a′ )) − γ max
′
(Q∗ (s′ , a′ )).
a a
Noticed that:
′′
Q∗ (s′ , ai ) = r(s′ , ai ) + γ max
′
(Q∗ (si , a′ )).
a
So:
Li ∼ Gumbel(r(s′ , ai ) + C1 , β1 ).
Let γmaxi (Li ) ∼ Gumbel(C2 (s, a), β2 ). Because the discounted factor γ is
positive number, according to Lemma 2:
n
X r(s′ ,ai )
C2 (s, a) = γ(C1 + β1 ln e β1
).
i=1
46
β2 = γ(β1 ).
So:
ϵ2 (s, a) ∼ Gumbel(C2 (s, a) − γ max
′
(Q∗ (s′ , a′ )), β2 ).
a
ϵ2 (s1 , ak ) = γ max
′
(Q∗ (s′1k , a′ ) + ϵ1 (s′1k , a′ )) − γ max
′
(Q∗ (s′1k , a′ )).
a a
From here we can see that, ϵ2 (s1 , ak ) and any ϵ1 (s′1k , aj ) are not indepen-
dent, on the other hand:
ϵ2 (s2 , aj ) = γ max
′
(Q∗ (s′2j , a′ ) + ϵ1 (s′2j , a′ )) − γ max
′
(Q∗ (s′2j , a′ )).
a a
ϵ3 (s, a) = γ max
′
(Q∗ (s′ , a′ ) + ϵ2 (s′ , a′ )) − γ max
′
(Q∗ (s′ , a′ )).
a a
′′
Mi ∼ Gumbel(Q∗ (s′ , ai ) + C2 (s′ , ai ) − γ max
′
(Q∗ (si , a′ )), β2 ).
a
47
Mi ∼ Gumbel(r(s′ , ai ) + C2 (s′ , ai ), β2 ).
Let:
n
X r(s′ ,ai )+C2 (s′ ,ai )
C3 (s, a) = γ(β2 ln e β2
).
i=1
β3 = γβ2 .
We will have:
Where
n
X r(s′ ,ai )
C2 (s, a) = γ(C1 + β1 ln e β1
).
i=1
n
X r(s′ ,ai )+Ct−1 (s′ ,ai )
48
If for ∀s1 , s2 , let us define S1 and S2 sets as follows:
S1 △S2 = ∅.
Then when t = 2:
n
X r(s′ ,ai )
C2 = γ(C1 + β1 ln e β1
).
i=1
β2 = γ(β1 ).
So:
ϵ2 (s, a) ∼ Gumbel(C2 − γ max
′
(Q∗ (s′ , a′ )), β2 ).
a
This means this condition removes the correlation between Ci and (s, a) under
our assumption. So:
Where
n
X r(s′ ,ai )
Ct = γ(Ct−1 + βt−1 ln e βt−1
)(t ≥ 2).
i=1
Proof. Let p1 (X), p2 (Y ) as the PDF for X, Y . P1 (X), P2 (Y ) as the CDF for
X, Y .
Z +∞ Z Y +z Z +∞
P (X−Y < z) = P (X < Y +z) = p1 (X)p2 (Y )dXdY = P1 (Y +z)p2 (Y )dY.
−∞ −∞ −∞
49
So:
Z +∞ Z +∞ Y +z−CX Y −CY
− 1 −( Y −C Y −
β
e−e
β +e )
P1 (Y + z)p2 (Y )dY = e β dY,
−∞ −∞ β
Z +∞ Y +z−CX Y −CY Z +∞ Y −CY CX −CY −z
1 − Y −CY − 1 Y −CY −
e−( e−
β
e−e e−e
β +e ) β (1+e β )
β dY = β dY,
β −∞ β −∞
Y −CY
Take U = e− β , then dU = − β1 U dY , then:
Z +∞ Y −CY CX −CY −z Z +∞ CX −CY −z
1 −
Y −CY
−e
− 1
e−U (1+e
β (1+e β ) β )
e β e dY = dU = CX −CY −z ,
β −∞ 0 (1 + e β )
So, we show that:
1 1
P (X − Y < z) = CX −CY −z = z−(CX −CY )
.
−
(1 + e β ) (1 + e β )
According to Section 3.4, we know that:
X − Y ∼ Logistic(CX − CY , β).
855
50
For (1):
Z +∞ Z +∞ Z +∞
−X −X −X −(X+e−X ) −X
E[e ]= e p(X)dX = e ·e dX = e−(2X+e )
dX.
−∞ −∞ −∞
For (2): Z +∞
−X
−X
E[Xe ]= Xe−(2X+e )
dX.
−∞
860 We split this integral into the parts for X > 0 and X < 0 for separate discussions
now.
When X > 0, it is easy to see that:
So we will have:
−X +∞ +∞ +∞
e(−2X−e )
Z Z Z
−2X
−e−X ) −X −2X −2X
In fact: Z +∞
−2X 1 1
e−(2X+e )
dX = − .
0 2 2e
On the other hand, obviously:
Z +∞ Z +∞
−X −2X
Xe−(2X+e ) dX < Xe−(2X+e )
dX.
0 0
In fact: Z −5 1
−0.1X
e−(0.1X+e )
dX = 10e−e 2 .
−∞
For (2), if we take X < 0:
−X
e(−2X−e ) −2X
−e−X ) −X −2X
51
Z 0
−X 20
e−(2X+e )
dX ≤ 5(eln4−2 ) = .
−5 e2
So, we can easily observe that:
Z +∞ Z −5 Z 0 Z +∞
−X −X −X −X
e−(2X+e ) dX = e−(2X+e ) dX+ e−(2X+e )
dX+ e−(2X+e )
dX.
−∞ −∞ −5 0
So: Z +∞
−X 20 1
1 1
e−(2X+e )
dX <
2
+ 10e−e 2 + − .
−∞ e 2 2e
Z +∞ Z 0 Z +∞
−X −2X −2X
Xe−(2X+e )
dX < Xe−(2X+e )
dX + Xe−(2X+e )
dX.
−∞ −∞ 0
The expectation of the Gumbel distribution is known. In fact, if X ∼ Gumbel(A, B),
then E[X] = A + vB where v ≈ 0.5772 < 0.6 represent the Euler–Mascheroni
constant. This has already been discussed in Section 3.4. In summary:
Z +∞
−X 20 1
1 1
e−(2X+e ) dX < 2 + 10e−e 2 + − .
−∞ e 2 2e
Z +∞
−X 1 1 3 3
Xe−(2X+e ) dX < ( )v < ( ) = .
−∞ 4 4 5 20
These are the boundaries when X follows a Gumbel(0, 1) distribution. Now,
let’s consider the case when X follows a Gumbel(A, 1) distribution. If X ∼
Gumbel(A, 1), according to Lemma 2, X − A ∼ Gumbel(0, 1), then E(e−(X−A) )
can be bounded:
20 1
−e 2 1 1 −X 20 1
−e 2 1 1
E(eA−X ) < + 10e + − → E[e ] < ( + 10e + − )e−A .
e2 2 2e e2 2 2e
3 3
E((X − A)eA−X ) < . → eA E[Xe−X ] − AeA E[e−X ] < .
20 20
3
eA E[Xe−X ] < + AeA E[e−X ].
20
So when A > 0:
3 20 1
1 1 3 20 1
1 1
eA E[Xe−X ] < +A( 2 +10e−e 2 + − ) → E[Xe−X ] < ( +A( 2 +10e−e 2 + − ))e−A .
20 e 2 2e 20 e 2 2e
But when A < 0, noticed that E[e−X ] > 0, so we have:
3 3
eA E[Xe−X ] < → E[Xe−X ] < ( )e−A .
20 20
52
Appendix A.6. Proof for Theorem 1:
Theorem 1. (Logistic distribution for Bellman error): The Bellman er-
ror εθ (s, a) approximately follows the Logistic distribution under the Assump-
tions 1-4. The degree of approximation can be measured by the upper bound
of KL divergence between:
n
X r(s′ ,ai )+Cθ (s′ ,ai )
X ∼ Gumbel(βθ ln e βθ
, βθ ).
i=1
and
n
X r(s′ ,ai )+Cθ (s′ ,ai )
Y ∼ Gumbel(γβθ ln e βθ
, γβθ ).
i=1
Pn r(s′ ,ai )+Cθ (s′ ,ai )
Let A∗ = ln i=1 e βθ
, we have these conclusions:
1
865 1. If A∗ > 0, then KL(Y ||X) < log( γ1 )+(1−γ)[A∗ ( 20
e2 +10e
−e 2
− 21 − 2e
1 3
)+ 20 −v].
1
3. The order of the KL divergence error is controlled at O(log( 1−κ 0
) + κ0 A∗ ).
Proof. According to (14), we have had the definition for the Bellman error under
the setting of parameter θ:
Where:
Q̂θ (s, a) = Q∗ (s, a) + ϵθ (s, a).
So:
εθ (s, a) = Q∗ (s, a) + ϵθ (s, a) − r(s, a) − γ max
′
(Q̂θ (s′ , a′ )).
a
53
Because of:
Q∗ (s, a) = r(s, a) + γmaxa′ Q∗ (s′ , a′ ).
So:
εθ (s, a) = γ max
′
[Q∗ (s′ , a′ )] + ϵθ (s, a) − γ max
′
[Q∗ (s′ , a′ ) + ϵθ (s′ , a′ )].
a a
870 Notice that this equation has two parts: (1)γ maxa′ [Q∗ (s′ , a′ )] + ϵθ (s, a) and
(2)γ maxa′ [Q∗ (s′ , a′ ) + ϵθ (s′ , a′ )]. Let us discuss them separately.
Let us first analyze part (1), according to Lemma 2, it is easy to have:
γ max
′
[Q∗ (s′ , a′ )] + ϵθ (s, a) ∼ Gumbel(Cθ (s, a), βθ ).
a
Because of:
−γ max
′
(Q∗ (si ′′ , a′ )) + Q∗ (s′ , ai ) = r(s′ , ai ).
a
So:
In the proof of Lemma 4, the independence of Li has already been taken into
account, therefore, using Lemma 3, we can know that:
n
X r(s′ ,ai )+Cθ (s′ ,ai )
maxai [Q∗ (s′ , ai ) + ϵθ (s′ , ai )] ∼ Gumbel(βθ ln e βθ
, βθ ).
i=1
According to the proof of Lemma 4, maxai [Q∗ (s′ , ai )+ϵθ (s′ , ai )] and γ maxa′ [Q∗ (s′ , a′ )]+
ϵθ (s, a) are independent under the same parameter θ. Now we want to use the
Lemma 5, according to Lemma 2, noticed that:
n
X r(s′ ,ai )+Cθ (s′ ,ai )
γmaxai [Q∗ (s′ , ai ) + ϵθ (s′ , ai )] ∼ Gumbel(γβθ ln e βθ
, γβθ ).
i=1
54
γ max
′
[Q∗ (s′ , a′ )] + ϵθ (s, a) ∼ Gumbel(Cθ (s, a), βθ ).
a
Thus we cannot use Lemma 5 directly because the scale parameters are not the
same even though they are independent, so we need to give an approximation
with certain error conditions now.
Assume that:
where:
n
X r(s′ ,ai )+Cθ (s′ ,ai )
A = βθ ln e βθ
.
i=1
B = βθ .
n r(s′ ,ai )+Cθ (s′ ,ai )
A X
A∗ = = ln e βθ
.
B i=1
Let us see the KL divergence between these two distributions, we treat the
PDF for Gumbel(A, B) and Gumbel(γA, γB) as p(x) and q(x), according to
Section 3.4, we have shown that:
55
Using Lemma 2, we have shown that if x ∼ Gumbel(γA, γB), then x′ =
x
γB
A
∼ Gumbel( B , 1). dx′ = 1
γB dx. So:
Z +∞ x−γA Z +∞
x 1 −( x−γA −
γB ) − x ′ A −(x′ − A ) ′ ′
Ex∼q(x) [e− γB ] = e γB +e e γB dx = e−(x − B +e B )
e−x dx′ = Ex′ [e−x ].
−∞ γB −∞
x ′ 20 1
1 1 A
Ex∼q(x) [e− γB ] = Ex′ [e−x ] < ( 2
+ 10e−e 2 + − )e− B .
e 2 2e
x ′ ′
Ex∼q(x) [xe− γB ] = Ex′ [γBx′ e−x ] = γBEx′ [x′ e−x ].
x 3 A 20 1
1 1 A 3 20 1
1 1 A
Ex∼q(x) [xe− γB ] < ( + ( 2 +10e−e 2 + − ))e− B γB = ( γB+γA( 2 +10e−e 2 + − ))e− B .
20 B e 2 2e 20 e 2 2e
If A ≤ 0, then:
x 3γB − A
Ex∼q(x) [xe− γB ] < ( )e B .
20
According to our assumption, this bound can be kept under a sufficiently small
−x
δ0 . Let H( 1t ) = Ex∼q(x) [e t ]. Using Lagrange’s mean value theorem, there can
be a l ∈ [γ, 1], satisfy:
H( B1 ) − H( γB
1
) 1
= H ′( ).
( B1 − 1
γB )
lB
Noticed that ( B1 − 1
γB ) < 0. Under our assumption, we have known that:
(1) A > 0:
1 x 3 20 1
1 1 A
H ′( ) = Ex∼q(x) [−xe− lB ] > −( γB + γA( 2 + 10e−e 2 + − ))e− B ,
lB 20 e 2 2e
So:
1 1 1 1−γ 3 γA 20 1
1 1 A
( − )H ′ ( ) < ( γ+ ( 2 + 10e−e 2 + − ))e− B .
B γB lB γ 20 B e 2 2e
(2) A ≤ 0:
1 x 3γB − A
H ′( ) = Ex∼q(x) [−xe− lB ] > −( )e B ,
lB 20
1 1 1 1−γ 3 A
( − )H ′ ( ) < ( γ)e− B .
B γB lB γ 20
56
Thus, we can rearrange the above equation to obtain:
(1) A > 0:
1 1 γ−1 1−γ 3 γA 20 1
1 1
KL(q(x)||p(x)) < log( )+ ( )Ex∼q(x) [x]+ ( γ+ ( 2 +10e−e 2 + − )).
γ B γ γ 20 B e 2 2e
(2) A ≤ 0:
1 1 γ−1 1−γ 3
KL(q(x)||p(x)) < log( ) + ( )Ex∼q(x) [x] + ( γ).
γ B γ γ 20
1 A 1−γ 3 γA 20 1
1 1
KL(q(x)||p(x)) < log( )+(γ−1)( +v)+ ( γ+ ( 2 +10e−e 2 + − )).
γ B γ 20 B e 2 2e
(2) A ≤ 0:
1 A 1−γ 3
KL(q(x)||p(x)) < log( ) + (γ − 1)( + v) + ( γ).
γ B γ 20
1 A 20 1
1 1 3
KL(q(x)||p(x)) < log( ) + (1 − γ)[ ( 2 + 10e−e 2 − − ) + − v].
γ B e 2 2e 20
(2) A ≤ 0:
1 3 A
KL(q(x)||p(x)) < log( ) + (1 − γ)[ − − v].
γ 20 B
Let us prove that these two upper bounds are well-defined.
(1)A > 0:
1
Let f (A) = log( γ1 ) + (1 − γ)[ B ( e2 + 10e−e 2 −
A 20 1
2 − 1
2e ) + 3
20 − v]. Obviously
f (A) > f (0), where:
1 3 1 9
f (0) = log( ) + (1 − γ)[ − v] > log( ) + (γ − 1)[ ] = g(γ),
γ 20 γ 20
∂g 1 9
=− + < 0.
∂γ γ 20
So:
f (A) > f (0) = g(γ) > g(1) = 0.
57
(2)A ≤ 0:
Let f (A) = log( γ1 ) + (1 − γ)[ 20
3
− A
B − v]. Obviously, it still holds that
f (A) ≥ f (0), it is consistent with the above discussion. So:
885 Therefore, these two bounds are well-defined and meaningful, they indicate
that two distributions can be considered approximately identical within the KL
divergence error limit. It is obvious that when γ = 1, KL(q(x)||p(x)) = 0.
Next, let’s discuss the order of this error, as defined, κ is in a small neighbor-
hood near the zero with the radius κ0 , then the growth order for KL divergence
KL(q(x)||p(x)) is:
1 A
O(log( ) + κ0 ).
1 − κ0 B
Within this error control range, we consider that γ does not affect the distribu-
tion type and coefficient magnitude, allowing us to apply Lemma 5 now.
According to Lemma 5:
n
X r(s′ ,ai )+Cθ (s′ ,ai )
εθ (s, a) ∼ Gumbel(Cθ (s, a), βθ ) − Gumbel(βθ ln e βθ
, βθ ),
i=1
Which means:
n
X r(s′ ,ai )+Cθ (s′ ,ai )
εθ (s, a) ∼ Logistic(Cθ (s, a) − βθ ln e βθ
, βθ ).
i=1
890
58
Appendix A.7. Proof for Theorem 2
895 If it satisfies
1. i1 ̸= 0,
Pi1 r + (s′ ,ai ) Pi2 r − (s′ ,ai )
2. i=1 e
βθ
r+ (s′ , ai ) + i=1 e
βθ
r− (s′ , ai ) < 0,
then there exists an optimal scaling ratio φ∗ > 1, such that for any scaling
ratio φ that can effectively reduce the expectation of the Bellman error, it must
satisfy
1 ≤ φ ≤ φ∗ .
Pn r(s′ ,ai )
According to condition (1):i1 ̸= 0, which means −βθ ln i=1 e
βθ
< 0, so if
the scaling factor φ is effective, it should satisfy:
n
X r(s′ ,ai )
n
X φr(s′ ,ai )
−βθ ln e βθ
≤ −βθ ln e βθ
.
i=1 i=1
59
which means:
i1 i2 i1 i2
X r + (s′ ,ai ) X r − (s′ ,ai ) X φr + (s′ ,ai ) X φr − (s′ ,ai )
e βθ
+ e βθ
≥ e βθ
+ e βθ
.
i=1 i=1 i=1 i=1
∂G
Obviously is monotonically increasing w.r.t. φ. According to our assump-
∂φ
Pi1 r+ (sβ′ ,ai ) + ′ Pi2 r− (sβ′ ,ai ) − ′
tion, we have ∂G
∂φ (1) = i=1 e θ r (s , ai ) + i=1 e
θ r (s , ai ) < 0,
limx→∞ ∂G
∂φ (x) > 0. According to zero theorem, there exist a φ∗ ∈ (1, +∞),
∗
satisfy ∂G
∂φ (φ ) = 0. when 1 ≤ φ ≤ φ∗ , G(1) ≥ G(φ). Which means:
i1 i2 i1 i2
X r + (s′ ,ai ) X r − (s′ ,ai ) X φr + (s′ ,ai ) X φr − (s′ ,ai )
G(1) = e βθ
+ e βθ
≥ e βθ
+ e βθ
= G(φ).
i=1 i=1 i=1 i=1
Because:
π(a′ |s′ )
X
∗ ′ ′ ′ k ′ ′
π = arg max[ P(s |s, a)π(a |s ) Q (s , a ) − ζ log ].
π µ(a′ |s′ )
Take
∂f ∂f
= 0, = 0.
∂L ∂π
We have:
Qk (s,a)+L−1
µ(a|s)e ζ = π(a|s).
60
L−1 X Qk (s,a) X Qk (s,a)
log(e ζ µ(a|s)e ζ ) = log(1) = 0 → L = 1 − log µ(a|s)e ζ .
ε
Proof. Let σ = t, then:
1 2
MSELoss = t .
2
LLoss(0, σ) = t + 2log(1 + e−t ).
1 1 1 1
LLoss = t+2[ln2− t+ t2 +o(t3 )] = ln4+ t2 +o(t3 ) = ln4+ MSELoss+o(t3 ).
2 8 4 2
905
Lemma 8. The sampling error Se in (16) can be decomposed into Bias and
Variance terms. If we define:
h i
(x ,x ,...,xN )
F (t) = E(x1 ,x2 ,...,xN ) F̂N 1 2 (t) ,
then
Se = Et [Variance(t) + Bias(t)] = Variance + Bias,
where
h i
(x ,x ,...,xN ) (x ,x ,...,xN )
Variance(t) = E(x1 ,x2 ,...,xN ) (F̂N 1 2 (t))2 − E2(x1 ,x2 ,...,xN ) [F̂N 1 2 (t)],
61
Proof. According to (16):
(x ,x2 ,...,xN ) 2 2
Se = Et E(x1 ,x2 ,...,xN ) [(F̂N 1 (t))2 − 2F (t)F (t) + F 2 (t) + F (t) − F (t)],
which means:
(x ,x2 ,...,xN ) 2
Se = Et E(x1 ,x2 ,...,xN ) [(F̂N 1 (t))2 − F (t) + (F (t) − F (t))2 ].
910 where
h i 2
(x ,x ,...,xN ) (x ,x ,...,xN )
Variance(t) = E(x1 ,x2 ,...,xN ) (F̂N 1 2 (t))2 − E(x1 ,x2 ,...,xN ) [F̂N 1 2 (t)] ,
N!
E[x(i) ] = [BL1 (N, i) + AL2 (N, i)] .
(i − 1)!(N − i)!
62
where:
Z +∞ +∞ Z +∞
(e−g )N +1−i (e−g )N +1−i (e−g )N +1−i
Z
(gB+A)dg = B gdg +A dg .
−∞ (1 + e−g )N +1 −∞
−g
(1 + e ) N +1
−∞ (1 + e
−g )N +1
| {z } | {z }
L1 (N,i) L2 (N,i)
Because:
Z +W
(e−g )N +1−i 1 +W −g N −i (e−g )N −i +W (N − i) +W (e−g )N −i
Z Z
1
dg = (e ) d( ) = | + dg.
−W (1 + e
−g )N +1 N −W (1 + e−g )N (1 + e−g )N −W N −W (1 + e
−g )N
Noticed that:
Z +∞ +W
(e−g ) eW
Z
1 1 1 1 1
L2 (i, i) = −g i+1
dg = lim d( −g i
) = lim [( W
)i −( W )i ] = .
−∞ (1 + e ) W →+∞ i −W (1 + e ) i W →+∞ e + 1 e +1 i
(N − i) (N − i − 1) (N − i − 2) 1 (N − i)!
L2 (N, i) = .... = (i − 1)!.
N N −1 N −2 i N!
Because:
+W +W
(e−g )N +1−i
Z Z
1 1
gdg = (e−g )N −i gd( ),
−W (1 + e−g )N +1 N −W (1 + e−g )N
63
Which means:
Z +W Z +W Z +W
(e−g )N +1−i 1 (e−g )N −i g +W (e−g )N −i g (e−g )N −i
gdg = [ | + (N −i) dg− dg].
−W (1 + e
−g )N +1 N (1 + e−g )N −W −W (1 + e−g )N −W (1 + e
−g )N
In conclusion, we have:
i−1 N −i
N! (N − i)! X1 X1 (N − i)!
E[x(i) ] = [B (i − 1)![ − ]+A (i − 1)!].
(i − 1)!(N − i)! N! k k N!
k=1 k=1
Which means:
i−1 N −i
X 1 X1
E[x(k) ] = [B[ − ] + A].
k k
k=1 k=1
Appendix B.1. The evolving Bellman error distribution over training time
64
Figure B.12: The evolving distributions of Bellman error computed by (14) at different
epochs of online RL training on three environments.
Appendix B.2. The variations of Bellman error distribution in online and of-
fline environments.
In this section, we will present all distribution details to confirm the char-
930 acteristics of the Logistic distribution, which is slightly shorter in the tail and
slightly longer in the head compared to the Gumbel distribution. In the head
region of the distribution, the Logistic distribution fits much better than the
Normal and Gumbel distributions, while in the tail region, the Logistic distri-
bution is superior to the Normal distribution and generally outperforms the
935 Gumbel distribution in most environments. These phenomena can be observed
from Figures B.14 and B.13. Figure B.14 provides a detailed view of the Bell-
man error distribution in the offline environment, while Figure B.13 displays
the detailed Bellman error distributions in the online environment.
65
Figure B.13: The distribution for Bellman error for the other online environment
during half of the training epochs.
Figure B.14: The distribution for Bellman error for the other offline environments
during the half of the training epochs.
66
Appendix B.3. The complete version of the containing goodness-of-fit tests and
940 KS tests for Bellman error
In this section, we will explain how to conduct the KS test. The specific
procedure involves the following three steps:
step 1 Collect the data as x ∈ [x1 , x2 , ...xn ] and compute its cumulative distri-
bution function as F ∗ (x). Fix the distribution to be tested as F (x).
67
and final average reward for different σ values in the online setting, with the
dashed line representing the MSELoss standard for SAC.
Figure B.15: The relationship between the variation of σ and the maximum average
reward in offline training.
Figure B.16: The relationship between the variation of σ and the average reward in
offline training.
68
Figure B.17: The relationship between the variation of σ and the maximum average
reward in online training.
Figure B.18: The relationship between the variation of σ and the average reward in
online training.
69