Professional Documents
Culture Documents
Jia Zhou - JMLR 2023
Jia Zhou - JMLR 2023
Abstract
We study the continuous-time counterpart of Q-learning for reinforcement learning (RL) un-
der the entropy-regularized, exploratory diffusion process formulation introduced by Wang
et al. (2020). As the conventional (big) Q-function collapses in continuous time, we con-
sider its first-order approximation and coin the term “(little) q-function”. This function
is related to the instantaneous advantage rate function as well as the Hamiltonian. We
develop a “q-learning” theory around the q-function that is independent of time discretiza-
tion. Given a stochastic policy, we jointly characterize the associated q-function and value
function by martingale conditions of certain stochastic processes, in both on-policy and
off-policy settings. We then apply the theory to devise different actor–critic algorithms
for solving underlying RL problems, depending on whether or not the density function
of the Gibbs measure generated from the q-function can be computed explicitly. One of
our algorithms interprets the well-known Q-learning algorithm SARSA, and another re-
covers a policy gradient (PG) based continuous-time algorithm proposed in Jia and Zhou
(2022b). Finally, we conduct simulation experiments to compare the performance of our
algorithms with those of PG-based algorithms in Jia and Zhou (2022b) and time-discretized
conventional Q-learning algorithms.
Keywords: continuous-time reinforcement learning, policy improvement, q-function,
martingale, on-policy and off-policy
1 Introduction
A recent series of papers, Wang et al. (2020) and Jia and Zhou (2022a,b), aim at laying
an overarching theoretical foundation for reinforcement learning (RL) in continuous time
with continuous state space (diffusion processes) and possibly continuous action space.
Specifically, Wang et al. (2020) study how to explore strategically (instead of blindly) by
formulating an entropy-regularized, distribution-valued stochastic control problem for dif-
fusion processes. Jia and Zhou (2022a) solve the policy evaluation (PE) problem, namely
to learn the value function of a given stochastic policy, by characterizing it as a martingale
problem. Jia and Zhou (2022b) investigate the policy gradient (PG) problem, that is, to
determine the gradient of a learned value function with respect to the current policy, and
show that PG is mathematically a simpler PE problem and thus solvable by the martingale
approach developed in Jia and Zhou (2022a). Combining these theoretical results naturally
leads to various online and offline actor–critic (AC) algorithms for general model-free (up to
the diffusion dynamics) RL tasks, where one learns value functions and stochastic policies
simultaneously and alternatingly. Many of these algorithms recover and/or interpret some
well-known existing RL algorithms for Markov decision processes (MDPs) that were often
put forward in a heuristic and ad hoc manner.
PG updates a policy along the gradient ascent direction to improve it; so PG is an
instance of the general policy improvement (PI) approach. On the other hand, one of the
earliest and most popular methods for PI is Q-learning (Watkins, 1989; Watkins and Dayan,
1992), whose key ingredient is to learn the Q-function, a function of state and action. The
learned Q-function is then maximized over actions at each state to achieve improvement
of the current policy. The resulting Q-learning algorithms, such as the Q-table, SARSA
and DQN (deep Q-network), are widely used for their simplicity and effectiveness in many
applications.1 Most importantly though, compared with PG, one of the key advantages of
using Q-learning is that it works both on-policy and off-policy (Sutton and Barto, 2018,
Chapter 6).
The Q-function, by definition, is a function of the current state and action, assuming that
the agent takes a particular action at the current time and follows through a given control
policy starting from the next time step. Therefore, it is intrinsically a notion in discrete
time; that is why Q-learning has been predominantly studied for discrete-time MDPs. In a
continuous-time setting, the Q-function collapses to the value function that is independent
of action and hence cannot be used to rank and choose the current actions. Indeed, Tallec
et al. (2019) opine that “there is no Q-function in continuous time”. On the other hand,
one may propose discretizing time to obtain a discretized Q-function and then apply the
existing Q-learning algorithms. However, Tallec et al. (2019) show experimentally that this
approach is very sensitive to time discretization and performs poorly with small time steps.
Kim et al. (2021) take a different approach: they include the action as a state variable in the
continuous-time system by restricting the action process to be absolutely continuous in time
with a bounded growth rate. Thus Q-learning becomes a policy evaluation problem with
the state–action pair as the new state variable. However, they consider only deterministic
dynamic systems and discretize upfront the continuous-time problem. Crucially, that the
action must be absolutely continuous is unpractically restrictive because optimal actions
are often only measurable in time and have unbounded variations (such as the bang–bang
controls) even for deterministic systems.
As a matter of fact, the aforementioned series of papers, Wang et al. (2020) and Jia
and Zhou (2022a,b), are characterized by carrying out all the theoretical analysis within
the continuous setting and taking observations at discrete times (for computing the to-
1. Q-learning is typically combined with various forms of function approximations. For example, linear
function approximations (Melo et al., 2008; Zou et al., 2019), kernel-based nearest neighbor regression
(Shah and Xie, 2018), or deep neural networks (Mnih et al., 2015; Fan et al., 2020). Q-learning is not
necessarily restricted to finite action space in literature; for example, Q-learning with continuous action
spaces is studied in Gu et al. (2016). However, Duan et al. (2016) report and claim that the standard
Q-learning algorithm may become less efficient for continuous action spaces.
2
Continuous-Time q-Learning
tal reward over time) only when implementing the algorithms. Some advantages of this
approach, compared with discretizing time upfront and then applying existing MDP re-
sults, are discussed extensively in Doya (2000); Wang et al. (2020); Jia and Zhou (2022a,b).
More importantly, the pure continuous-time approach minimizes or eliminates the impact
of time-discretization on learning which, as discussed above, becomes critical especially for
continuous-time Q-learning.
Now, if we are to study Q-learning strictly within the continuous-time framework with-
out time-discretization or action restrictions, then the first question is what a proper Q-
function should be. When time is discretized, Baird (1993) and Mnih et al. (2016) define
the so-called advantage function that is the difference in value between taking a particular
action versus following a given policy at any state; that is, the difference between the state–
action value and the state value. Baird (1993, 1994) and Tallec et al. (2019) notice that such
an advantage function can be properly scaled with respect to the size of time discretization
whose continuous-time limit exists, called the (instantaneous) advantage rate function and
can be learned. The advantage updating performs better than conventional Q-learning,
as shown numerically in Tallec et al. (2019). However, these papers again consider only
deterministic dynamic systems and do not fully develop a theory of this rate function. Here
in the present paper, we attack the problem directly from the continuous-time perspective.
In particular, we rename the advantage rate function as the (little) q-function.2 This q-
function in the finite-time episodic setting is a function of the time–state–action triple under
a given policy, analogous to the conventional Q-function. However, it is a continuous-time
notion because it does not depend on time-discretization. This feature is a vital advantage
for learning algorithm design as it avoids the sensitivity with respect to the observation and
intervention frequency (the step size of time-discretization).
For the entropy-regularized, exploratory setting for diffusion processes (first formulated
in Wang et al. 2020) studied in the paper, the q-function of a given policy turns out to
be the Hamiltonian that includes the infinitesimal generator of the dynamic system and
the instantaneous reward function (see Yong and Zhou 1999 for details), plus the temporal
dispersion that consists of the time-derivative of the value function and the depreciation
from discounting. The paper aims to develop a comprehensive q-learning theory around
this q-function and accordingly design alternative RL algorithms other than the PG-based
AC algorithms in Jia and Zhou (2022b). There are three main questions. The first is to
characterize the q-function. For discrete-time MDPs, the PE method is used to learn both
the value function and the Q-function because both functions are characterized by similar
Bellman equations. By contrast, in the continuous-time diffusion setting, a Bellman equa-
tion is only available for the value function (also known as the Feynman–Kac formula, which
yields that the value function solves a second-order linear partial differential equation). The
second question is to establish a policy improvement theorem based on the q-function and
to design corresponding AC algorithms in a sample/data-driven, model-free fashion. The
third question is whether the capability of both on- and off-policy learning, a key advantage
of Q-learning, is intact in the continuous-time setting.
2. We use the lower case letter “q” to denote this rate function, much in the same way as the typical use
of the lower case letter f to denote a probability density function which is the first-order derivative of
the cumulative distribution function usually denoted by the upper case letter F .
3
Jia and Zhou
We provide complete answers to all these questions. First, given a stochastic policy
along with its (learned) value function, the corresponding q-function is characterized by
the martingale condition of a certain stochastic process with respect to an enlarged filtra-
tion (information field) that includes both the original environmental noise and the action
randomization. Alternatively, the value function and the q-function can be jointly deter-
mined by the martingality of the same process. These results suggest that the martingale
perspective for PE and PG in Jia and Zhou (2022a,b) continues to work for q-learning. In
particular, we devise several algorithms to learn these functions in the same way as the
martingale conditions are employed to generate PE and PG algorithms in Jia and Zhou
(2022a,b). Interestingly, a temporal–difference (TD) learning algorithm designed from this
perspective recovers and, hence, interprets a version of the well-known SARSA algorithm in
the discrete-time Q-learning literature. Moreover, inspired by these continuous-time mar-
tingale characterizations, in Appendix A we present martingale conditions for Q-learning
of discrete-time MDPs, which are new to our best knowledge. This demonstrates that the
continuous-time RL study may offer new perspectives even in discrete time. Second, we
prove that a Gibbs sampler with a properly scaled current q-function (independent of time
discretization) as the exponent of its density function improves upon the current policy.
This PI result translates into a policy-updating algorithm when the normalizing constant
of the Gibbs measure can be explicitly computed. Otherwise, if the normalizing constant
is unavailable, we prove a stronger PI theorem in terms of the Kullback–Leibler (KL) di-
vergence, analogous to a result for discrete-time MDPs in Haarnoja et al. (2018a). One
of the algorithms out of this theorem happens to recover a PG-based AC algorithm in Jia
and Zhou (2022b). Finally, we prove that the aforementioned martingale conditions hold
for both the on-policy and off-policy settings. More precisely, the value function and the
q-function associated with a given target policy can be learned based on a dataset generated
by either the target policy itself or an arbitrary behavior policy. So, q-learning indeed works
off-policy as well.
The rest of the paper is organized as follows. In Section 2 we review Wang et al.
(2020)’s entropy-regularized, exploratory formulation for RL in continuous time and space,
and present some useful preliminary results. In Section 3, we establish the q-learning theory,
including the motivation and definition of the q-function and its on-/off-policy martingale
characterizations. q-learning algorithms are developed, depending on whether or not the
normalizing constant of the Gibbs measure is available, in Section 4 and Section 5, respec-
tively. Extension to ergodic tasks is presented in Section 6. In Section 7, we illustrate the
proposed algorithms and compare them with PG-based algorithms and time-discretized,
conventional Q-learning algorithms on two specific examples with simulated data.3 Finally,
Section 8 concludes. The Appendix contains various supplementary materials and proofs
of statements in the main text.
4
Continuous-Time q-Learning
denote all the k × k symmetric and positive definite matrices. Given two matrices A and B
of the same size, denote by A ◦ B their inner product, by |A| the Euclidean/Frobenius norm
of A, by det A the determinant when√A is a square matrix, and by A⊤ the transpose of any
matrix A. For A ∈ Sk++ , we write A = U D1/2 U ⊤ , where A = U DU ⊤ is its eigenvalue
decomposition with U an orthogonal matrix, D a diagonal matrix, and D1/2 the diagonal
matrix whose entries are the square root of those of D. We use f = f (·) to denote the
function f , and f (x) to denote the function value of f at x. We denote by N (µ, Σ) the
probability density function of the multivariate normal distribution with mean vector µ and
covariance matrix Σ. Finally, for any stochastic process X = {Xs , s ≥ 0}, we denote by
{FsX }s≥0 the natural filtration generated by X.
where r is an (instantaneous) reward function (i.e., the expected rate of reward conditioned
on time, state, and action), h is the lump-sum reward function applied at the end of the
planning period T , and β ≥ 0 is a constant discount factor that measures the time-value of
the payoff or the impatience level of the agent.
Note in the above, the state process X a = {Xsa , t ≤ s ≤ T } also depends on (t, x).
However, to ease notation, here and henceforth we use X a instead of X t,x,a = {Xst,x,a , t ≤
s ≤ T } to denote the solution to SDE (1) with initial condition Xta = x whenever no
ambiguity may arise.
The (generalized) Hamiltonian is a function H : [0, T ] × Rd × A × Rd × Rd×d → R
associated with problem (1)–(2) defined as (Yong and Zhou, 1999):
1
H(t, x, a, p, q) = b(t, x, a) ◦ p + σσ ⊤ (t, x, a) ◦ q + r(t, x, a). (3)
2
We make the same assumptions as in Jia and Zhou (2022b) to ensure theoretically the
well-posedness of the stochastic control problem (1)–(2).
5
Jia and Zhou
Assumption 1 The following conditions for the state dynamics and reward functions hold
true:
(ii) b, σ are uniformly Lipschitz continuous in x, i.e., for φ ∈ {b, σ}, there exists a
constant C > 0 such that
(iii) b, σ have linear growth in x, i.e., for φ ∈ {b, σ}, there exists a constant C > 0 such
that
|φ(t, x, a)| ≤ C(1 + |x|), ∀(t, x, a) ∈ [0, T ] × Rd × A;
(iv) r and h have polynomial growth in (x, a) and x respectively, i.e., there exist constants
C > 0 and µ ≥ 1 such that
|r(t, x, a)| ≤ C(1 + |x|µ + |a|µ ), |h(x)| ≤ C(1 + |x|µ ), ∀(t, x, a) ∈ [0, T ] × Rd × A.
Classical model-based stochastic control theory has been well developed (e.g., Fleming
and Soner 2006 and Yong and Zhou 1999) to solve the above problem, assuming that the
functional forms of b, σ, r, h are all given and known. A centerpiece of the theory is the
Hamilton–Jacobi–Bellman (HJB) equation
∂V ∗ ∂V ∗ ∂2V ∗
(t, x) + sup H t, x, a, (t, x), 2
(t, x) − βV ∗ (t, x) = 0, V ∗ (T, x) = h(x). (4)
∂t a∈A ∂x ∂x
∗ 2 ∗
Here, ∂V d ∂ V
∂x ∈ R is the gradient and ∂x2 ∈ R
d×d is the Hessian matrix.
Under proper conditions, the unique solution (possibly in the sense of viscosity solution)
to (4) is the optimal value function to the problem (1)–(2), i.e.,
Z T
∗ PW −β(s−t) −β(T −t)
V (t, x) = sup E e r(s, Xsa , as )ds +e h(XTa ) Xta =x .
{as ,t≤s≤T } t
∂V ∗ ∂2V ∗
∗
a (t, x) = arg sup H t, x, a, (t, x), (t, x) (5)
a∈A ∂x ∂x2
is the optimal (non-stationary, feedback) control policy of the problem. In view of the
definition of the Hamiltonian (3), the maximization in (5) indicates that, at any given time
and state, one ought to maximize a suitably weighted average of the (myopic) instantaneous
reward and the risk-adjusted, (long-term) positive impact on the system dynamics. See
Yong and Zhou (1999) for a detailed account of this theory and many discussions on its
economic interpretations and implications.
6
Continuous-Time q-Learning
4. This procedure applies to both the offline and online settings. In the former, the agent can repeatedly
try different sequences of actions over the full time period [0, T ], record the corresponding state processes
and payoffs, and then learn and update the policy based on the resulting dataset. In the latter, the agent
updates the actions as she goes, based on all the up-to-date historical observations.
5. Here we assume that the action space A is continuous and randomization is restricted to those distribu-
tions that have density functions. The analysis and results of this paper can be easily extended to the
cases of discrete action spaces and/or randomization with probability mass functions.
6. Here, X π also depends on the specific copy aπ sampled from π; however, to ease notation we write X π
π
instead of X π,a .
7
Jia and Zhou
where EP is the expectation with respect to both the Brownian motion and the action
randomization, and γ ≥ 0 is a given weighting parameter on exploration also known as the
temperature parameter.
Wang et al. (2020) consider the following SDE
dXs = b̃ s, Xs , π(·|s, Xs ) dt + σ̃ s, Xs , π(·|s, Xs ) dWs , s ∈ [t, T ]; Xt = x, (8)
where
Z sZ
b̃ s, x, π(·) = b(s, x, a)π(a)da, σ̃ s, x, π(·) = σσ ⊤ (s, x, a)π(a)da.
A A
Intuitively, based on the law of large number, the solution of (8), denoted by {X̃sπ , t ≤ s ≤
T }, is the limit of the average of the sample trajectories X π over randomization (i.e., copies
of π). Rigorously, it follows from the property of Markovian projection due to Brunick and
Shreve (2013, Theorem 3.6) that Xsπ and X̃sπ have the same distribution for each s ∈ [t, T ].
Hence, the value function (7) is identical to
Z T
PW −β(s−t) −β(T −t)
J(t, x; π) =E e R̃(s, X̃sπ , π(·|s, X̃sπ ))ds +e h(X̃Tπ ) X̃tπ =x . (9)
t
R
where R̃(s, x, π) = A [r(s, x, a) − γ log π(a)]π(a)da.
The function J(·, ·; π) is called the value function of the policy π, and the task of RL is
to find
J ∗ (t, x) = max J(t, x; π), (10)
π∈Π
where Π stands for the set of admissible (stochastic) policies. The following gives the precise
definition of admissible policies.
(i) π(·|t, x) ∈ P(A), supp π(·|t, x) = A for every (t, x) ∈ [0, T ] × Rd , and π(a|t, x) :
(t, x, a) ∈ [0, T ] × Rd × A → R is measurable;
8
Continuous-Time q-Learning
(iii) For any given α > 0, the entropy of π and its α-moment have polynomial growth in x,
i.e., there are constants C = C(α) > 0 and µ′ = µ′ (α) ≥ 1 such that | A − log π(a|t, x)π(a|t, x)da| ≤
R
′ ′
C(1 + |x|µ ), and A |a|α π(a|t, x)da ≤ C(1 + |x|µ ) ∀(t, x).
R
Under Assumption 1 along with (i) and (ii) in Definition 1, Jia and Zhou (2022b, Lemma
2) show that the SDE (8) admits a unique strong solution since its coefficients are locally
Lipschitz continuous and have linear growth; see Mao (2007, Chapter 2, Theorem 3.4) for
the general result. In addition, the growth condition (iii) guarantees that the new payoff
function (9) is finite. More discussions on the conditions required in the above definition
can be found in Jia and Zhou (2022b). Moreover, Definition 1 only contains the conditions
on the policy, hence they can be verified.
Note that the problem (8)–(9) is mathematically equivalent to the problem (6)–(7).
Nevertheless, they serve different purposes in our study: the former provides a framework
for theoretical analysis of the value function, while the latter directly involves observable
samples for algorithm design.
∂2J
Z
∂J ∂J
(t, x; π) + H t, x, a, (t, x; π), 2 (t, x; π) − γ log π(a|t, x) π(a|t, x)da
∂t A ∂x ∂x
(11)
− βJ(t, x; π) = 0,
J(T, x; π) = h(x).
This is a version of the celebrated Feynman–Kac formula in the current RL setting. Under
Assumption 1 as well as the admissibility conditions in Definition 1, the PDE (11) admits a
unique viscosity solution among polynomially growing functions (Beck et al., 2021, Theorem
1.1).
On the other hand, Tang et al. (2022, Section 5) derive the following exploratory HJB
equation for the problem (8)–(9) satisfied by the optimal value function:
∂J ∗ ∂J ∗ ∂2J ∗
Z
(t, x) − γ log π(a) π(a)da − βJ ∗ (t, x; π) = 0,
(t, x) + sup H t, x, a, (t, x),
∂t π∈P(A) A ∂x ∂x2
J ∗ (T, x) = h(x).
(12)
Moreover, the optimal (stochastic) policy is a Gibbs measure or Boltzmann distribution:
∂J ∗ ∂2J ∗
∗ 1
π (·|t, x) ∝ exp H t, x, ·, (t, x), (t, x)
γ ∂x ∂x2
9
Jia and Zhou
This result shows that one should use a Gibbs sampler in general to generate trial-and-error
strategies to explore the environment when the regularizer is chosen to be entropy.7
In the special case when the system dynamics are linear in action and payoffs are
quadratic in action, the Hamiltonian is a quadratic function of action and Gibbs thus
specializes to Gaussian under mild conditions.
Plugging (13) to (12) to replace the supremum operator therein leads to the following
equivalent form of the exploratory HJB equation
∂J ∗ ∂J ∗ ∂2J ∗
Z
1
(t, x) + γ log exp H t, x, a, (t, x), 2
(t, x) da − βJ ∗ (t, x) = 0,
∂t A γ ∂x ∂x (14)
J ∗ (T, x) = h(x).
More theoretical properties regarding (14) and consequently the problem (6)–(7) including
its well-posedness can be found in Tang et al. (2022).
The Gibbs measure can also be used to derive a policy improvement theorem. Wang and
Zhou (2020) prove such a theorem in the context of continuous-time mean–variance port-
folio selection, which is essentially a linear–quadratic (LQ) control problem. The following
extends that result to the general case.
∂2J
Theorem 2 For any given π ∈ Π, define π ′ (·|t, x) ∝ exp{ γ1 H t, x, ·, ∂J
∂x (t, x; π), ∂x2 (t, x; π) }.
If π ′ ∈ Π, then
J(t, x; π ′ ) ≥ J(t, x; π).
Moreover, if the following map
∂2J
exp{ γ1 H t, x, ·, ∂J
∂x (t, x; π), ∂x2 (t, x; π) }
I(π) = R 1 ∂J ∂2J
, π∈Π
A exp{ γ H t, x, a, ∂x (t, x; π), ∂x2 (t, x; π) }da
At this point, Theorem 2 remains a theoretical result that cannot be directly applied
∂J ∂2J
to learning procedures, because the Hamiltonian H t, x, a, ∂x (t, x; π), ∂x2 (t, x; π) depends
on the knowledge of the model parameters which we do not have in the RL context. To
develop implementable algorithms, we turn to Q-learning.
10
Continuous-Time q-Learning
3.1 Q-function
Tallec et al. (2019) consider a continuous-time MDP and then discretize it upfront to a
discrete-time MDP with time discretization δt. In that setting, the authors argue that “there
is no Q-function in continuous time” because the Q-function collapses to the value function
when δt is infinitesimally small. Here, we take an entirely different approach that does
not involve time discretization. Incidentally, by comparing the form of the Gibbs measure
(13) with that of the widely employed Boltzmann exploration for learning in discrete-time
MDPs, Gao et al. (2020) and Zhou (2021) conjecture that the continuous-time counterpart
of the Q-function is the Hamiltonian. Our approach will also provide a rigorous justification
on and, indeed, a refinement of this conjecture.
Given π ∈ Π and (t, x, a) ∈ [0, T ) × Rd × A, consider a “perturbed” policy of π, denoted
by π̃, as follows: It takes the action a ∈ A on [t, t + ∆t) where ∆t > 0, and then follows π
on [t + ∆t, T ]. The corresponding state process X π̃ , given Xtπ̃ = x, can be broken into two
pieces. On [t, t + ∆t), it is X a which is the solution to
dXsa = b(s, Xsa , a)ds + σ(s, Xsa , a)dWs , s ∈ [t, t + ∆t); Xta = x,
while on [t+∆t, T ], it is X π following (6) but with the initial time–state pair (t+∆t, Xt+∆t
a ).
With ∆t > 0 fixed, we introduce the (∆t-parameterized) Q-function, denoted by Q∆t (t, x, a; π),
to be the expected reward drawn from the perturbed policy, π̃:8
Q∆t (t, x, a; π)
Z t+∆t
=EP
e−β(s−t) r(s, Xsa , a)ds
t
Z T
−β(s−t) −β(T −t)
+ e [r(s, Xsπ , aπ
s) −γ log π(aπ π
s |s, Xs )]ds +e h(XTπ ) Xtπ̃ =x .
t+∆t
Recall that our formulation (7) includes an entropy regularizer term that incentivizes
exploration using stochastic policies. However, in defining Q∆t (t, x, a; π) above we do not
include such a term on [t, t + ∆t) for π̃ because a deterministic constant action a is applied
whose entropy is excluded.
The following proposition provides an expansion of this Q-function in ∆t.
Proposition 3 We have
Q∆t (t, x, a; π)
∂2J
∂J ∂J
=J(t, x; π) + (t, x; π) + H t, x, a, (t, x; π), 2 (t, x; π) − βJ(t, x; π) ∆t + o(∆t).
∂t ∂x ∂x
(15)
So, this ∆t-based Q-function includes three terms. The leading term is the value function
J, which is independent of the action a taken. This is natural because we only apply a for
a time window of length ∆t and hence the impact of this action is only of the order ∆t.
8. In the existing literature, Q-learning with entropy regularization is often referred to as the “soft Q-
learning” with the associated “soft Q-function”. A review of and more discussions on soft Q-learning in
discrete time can be found in Appendix A as well as in Haarnoja et al. (2018b); Schulman et al. (2017).
11
Jia and Zhou
Next, the first-order term of ∆t is the Hamiltonian plus the temporal change of the value
function (consisting of the derivative of the value function in time and the discount term,
both of which would disappear for stationary and non-discounted problems), in which only
the Hamiltonian depends on the action a. It is interesting to note that these are the same
terms in the Feynman–Kac PDE (11), less the entropy term. Finally, the higher-order
residual term, o(∆t), comes from the approximation error of the integral over [t, t + ∆t] and
can be ignored when such errors are aggregated as ∆t gets smaller.
Proposition 3 yields that this Q-function also collapses to the value function when
∆t → 0, similar to what Tallec et al. (2019) claim. However, we must emphasize that
our Q-function is not the same as the one used in Tallec et al. (2019), who discretize time
upfront and study the resulting discrete-time MDP. The corresponding Q-function and value
function therein exist based on the discrete-time RL theory, denoted respectively by Qδt , V δt
where δt is the time discretization size. Then Tallec et al. (2019) argue that the advan-
tage function Aδt = Qδt − V δt can be properly scaled by δt, leading to the existence of
δt
limδt→0 Aδt . In short, the Q-function and the resulting algorithms in Tallec et al. (2019)
are still in the realm of discrete-time MDPs. By contrast, Q∆t (t, x, a; π) introduced here
is a continuous-time notion. It is the value function of a policy that applies a constant
action in [t, t + ∆t] and follows π thereafter. So ∆t here is a parameter representing the
length of period in which the constant action is taken, not the time discretization size as in
Tallec et al. (2019). Most importantly, our Q-function is just a tool used to introduce the q-
δt
function, the centerpiece of this paper. Having said all these, we recognize that limδt→0 Aδt
as identified by Tallec et al. (2019) is the counterpart of our q-function in their setting, and
that we arrive at this same object in two different ways (discretizing upfront then taking
δt → 0, versus a true continuous-time analysis).
3.2 q-function
Since the leading term in the Q-function, Q∆t (t, x, a; π), coincides with the value function
of π and hence can not be used to rank action a, we focus on the first-order approxi-
mation, which gives an infinitesimal state–action value. Motivated by this, we define the
“q-function” as follows:
Definition 4 The q-function of the problem (6)–(7) associated with a given policy π ∈ Π
is defined as
∂2J
∂J ∂J
q(t, x, a; π) = (t, x; π) + H t, x, a, (t, x; π), 2 (t, x; π) − βJ(t, x; π),
∂t ∂x ∂x (16)
d
(t, x, a) ∈ [0, T ] × R × A.
Clearly, this function is the first-order derivative of the Q-function with respect to ∆t,
as an immediate consequence of Proposition 3:
Corollary 5 We have
12
Continuous-Time q-Learning
Some remarks are in order. First, the q-function is a function of the time–state–action
triple under a given policy, analogous to the conventional Q-function for MDPs; yet it is a
continuous-time notion because it does not depend on any time-discretization. This feature
is a vital advantage for learning algorithm design as Tallec et al. (2019) point out that
the performance of RL algorithms is very sensitive with respect to the time discretization.
Second, the q-function is related to the advantage function in the MDP literature (e.g., Baird
1993; Mnih et al. 2016), which describes the difference between the state–action value and
the state value. Here in this paper, the q-function reflects the instantaneous advantage rate
of a given action at a given time–state pair under a given policy.
We also notice that in (16) only the Hamiltonian depends on a; hence the improved
policy presented in Theorem 2 can also be expressed in terms of the q-function:
∂2J
′ 1 ∂J 1
π (·|t, x) ∝ exp H t, x, ·, (t, x; π), 2 (t, x; π) ∝ exp q(t, x, ·; π) .
γ ∂x ∂x γ
Consequently, if we can learn the q-function q(·, ·, ·; π) under any policy π, then it follows
from Theorem 2 that we can improve π by implementing a Gibbs measure over the q-values,
analogous to, say, ε-greedy policy in classical Q-learning (Sutton and Barto, 2018, Chapter
6). Finally, the analysis so far justifies the aforementioned conjecture about the proper
continuous-time version of the Q-function, and provides a theoretical interpretation to the
widely used Boltzmann exploration in RL.
The main theoretical results of this paper are martingale characterizations of the q-
function, which can in turn be employed to devise algorithms for learning crucial functions
including the q-function, in the same way as in applying the martingale approach for PE
(Jia and Zhou, 2022a).9
The following first result characterizes the q-function associated with a given policy π,
assuming that its value function has already been learned and known.
Theorem 6 Let a policy π ∈ Π, its value function J and a continuous function q̂ : [0, T ] ×
Rd × A → R be given. Then
(i) q̂(t, x, a) = q(t, x, a; π) for all (t, x, a) ∈ [0, T ] × Rd × A if and only if for all (t, x) ∈
[0, T ] × Rd , the following process
Z s
−βs π
e J(s, Xs ; π) + e−βu [r(u, Xuπ , aπ π π
u ) − q̂(u, Xu , au )]du (18)
t
13
Jia and Zhou
(iii) If there exists π ′ ∈ Π such that for all (t, x) ∈ [0, T ] × Rd , (19) is an ({Fs }s≥0 , P)-
′
martingale where Xtπ = x, then q̂(t, x, a) = q(t, x, a; π) for all (t, x, a) ∈ [0, T ]×Rd ×A.
Jia and Zhou (2022a) unify and characterize the learning of value function (i.e., PE) by
martingale conditions of certain stochastic processes. Theorem 6 shows that learning the
q-function again boils down to maintaining the martingality of the processes (18) or (19).
However, there are subtle differences between these martingale conditions. Jia and Zhou
(2022a) consider only deterministic policies so the martingales therein are with respect to
({FsW }s≥0 , PW ). Jia and Zhou (2022b) extends the policy evaluation to include stochastic
policies but the related martingales are in terms of the averaged state X̃ π and hence also
with respect to ({FsW }s≥0 , PW ). By contrast, the martingality in Theorem 6 is with respect
to ({Ft }t≥0 , P), where the filtration {Ft }t≥0 is the enlarged one that includes the random-
ness for generating actions. Note that the filtration determines the class of test functions
necessary to characterize a martingale through the so-called martingale orthogonality con-
ditions (Jia and Zhou, 2022a). So the above martingale conditions suggest that one ought to
choose test functions dependent of the past and current actions when designing algorithms
to learn the q-function. Moreover, the q-function can be interpreted as a compensator to
guarantee the martingality over this larger information field.10
Theorem 6-(i) informs on-policy learning, namely, learning the q-function of the given
policy π, called the target policy, based on data {(s, Xsπ , aπ s ), t ≤ s ≤ T } generated by π
itself. Nevertheless, one of the key advantages of classical Q-learning for MDPs is that it
also works for off-policy learning, namely, learning the q-function of a given target policy
π based on data generated by a different admissible policy π ′ , called a behavior policy.
On-policy and off-policy reflect two different learning settings depending on the availability
of data and/or the choice of a learning agent. When data generation can be controlled by
the agent, conducting on-policy learning is possible although she could still elect off-policy
learning. By contrast, when data generation is not controlled by the agent and she has to
rely on data under other policies, then it becomes off-policy learning. Theorem 6-(ii) and
-(iii) stipulate that our q-learning in continuous time can also be off-policy.
Finally, (20) is a consistency condition to uniquely identify the q-function. If we only
consider deterministic policies of the form a(·, ·) in which case γ = 0 and π(·|t, x) degenerates
into the Dirac measure concentrating on a(t, x), then (20) reduces to
14
Continuous-Time q-Learning
Then
(i) Jˆ and q̂ are respectively the value function and the q-function associated with π if and
only if for all (t, x) ∈ [0, T ] × Rd , the following process
Z s
e−βs J(s,
ˆ Xsπ ) + e−βu [r(u, Xuπ , aπ π π
u ) − q̂(u, Xu , au )]du (22)
t
15
Jia and Zhou
∂J ∗ ∂J ∗ ∂2J ∗
∗
q (t, x, a) = (t, x) + H t, x, a, (t, x), (t, x) − βJ ∗ (t, x), (24)
∂t ∂x ∂x2
where J ∗ is the optimal value function satisfying the exploratory HJB equation (14).
Proposition 8 We have Z
1
exp{ q ∗ (t, x, a)}da = 1, (25)
A γ
for all (t, x), and consequently the optimal policy π ∗ is
1
π ∗ (a|t, x) = exp{ q ∗ (t, x, a)}. (26)
γ
As seen from its proof (in Appendix C), Proposition 8 is an immediate consequence of
the exploratory HJB equation (14). Indeed, (25) is equivalent to (14) when viewed as an
equation for J ∗ due to (24). However, in terms of q ∗ , satisfying (25) is only a necessary
condition or a minimum requirement for being the optimal q-function, and by no means
sufficient, in the current RL setting. This is because in the absence of knowledge about the
primitives b, σ, r, h, we are unable to determine J ∗ from q ∗ by their relationship (24) which
can be viewed as a PDE for J ∗ . To fully characterize J ∗ as well as q ∗ , we will have to resort
to martingale condition, as stipulated in the following result.
Then
c∗ and qb∗ are respectively the optimal value function and the optimal q-function,
(i) If J
then for any π ∈ Π and all (t, x) ∈ [0, T ] × Rd , the following process
Z s
−βs c∗ π
e J (s, Xs ) + e−βu [r(u, Xuπ , aπ b∗ π π
u ) − q (u, Xu , au )]du (28)
t
(ii) If there exists one π ∈ Π such that for all (t, x) ∈ [0, T ] × Rd , (28) is an ({Fs }s≥0 , P)-
martingale, then J c∗ and qb∗ are respectively the optimal value function and the optimal
q-function.
16
Continuous-Time q-Learning
Theorem 9 lays a foundation for off-policy learning without having to go through it-
erative policy improvement. We emphasize here that to learn the optimal policy and the
optimal q-function, it is infeasible to conduct on-policy learning because π ∗ is unknown and
hence the constraints (21) can not be checked nor enforced. The new constraints in (27)
no longer depend on policies and instead give basic requirements for the candidate value
function and q-function. Maintaining the martingality of (28) under any admissible policy
is equivalent to the optimality of the candidate value function and q-function. Moreover,
Theorem 9-(ii) suggests that we do not need to use data generated by all admissible poli-
cies. Instead, those generated by one – any one – policy should suffice; e.g. we could take
π(a|t, x) = exp{ γ1 qb∗ (t, x, a)} as the behavior policy.
Finally, let us remark that Theorems 7 and 9 are independent of each other in the
sense that one does not imply the other. One may design different algorithms from them
depending on different cases and needs.
for all θ ∈ Θ ⊂ RLθ , ψ ∈ Ψ ⊂ RLψ . Then Theorem 2 suggests that the policy can be
improved by
exp{ γ1 q ψ (t, x, a)}
π ψ (a|t, x) = R 1 ψ , (30)
A exp{ γ q (t, x, a)}da
while by assumption the normalizing constant A exp{ γ1 q ψ (t, x, a)}da for any ψ ∈ Ψ can be
R
explicitly computed. Next, we can use the data generated by π ψ and apply the martingale
condition in Theorem 7 to learn its value function and q-function, leading to an iterative
actor–critic algorithm.
12. Conventional Q-learning or SARSA is often said to be value-based in which the Q-function is a state–
action value function serving as a critic. In q-learning, as Theorem 6 stipulates, the q-function is uniquely
determined by the value function J as its compensator. Hence, q plays a dual role here: it is both a
critic (as it can be derived endogenously from the value function) and an actor (as it derives an improved
policy). This is why we call the q-learning algorithms actor–critic, if with a slight abuse of the term
because the “actor” here is not purely exogenous as with, say, policy gradient.
17
Jia and Zhou
Alternatively, we may choose to directly learn the optimal value function and q-function
based on Theorem 9. In this case, the approximators J θ and q ψ should satisfy
Z
θ 1
J (T, x) = h(x), exp{ q ψ (t, x, a)}da = 1. (31)
A γ
Note that if the policy π in (29) is taken in the form (30), then the second equation in
(31) is automatically satisfied. Henceforth in this section, we focus on deriving algorithms
based on Theorem 9 and hence always impose the constraints (31). In this case, any approx-
imator of the q-function directly gives that of the policy via π ψ (a|t, x) = exp{ γ1 q ψ (t, x, a)};
thereby we avoid learning the q-function and the policy separately. Moreover, making use
of (31) typically results in more special parametric form of the q-function approximator q ψ ,
potentially facilitating more efficient learning. Here is an example.
Example 1 When the system dynamic is linear in action a and reward is quadratic in a,
the Hamiltonian, and hence the q-function, is quadratic in a. So we can parameterize
1 ⊤
q ψ (t, x, a) = − q2ψ (t, x)◦ a − q1ψ (t, x) a − q1ψ (t, x) +q0ψ (t, x), (t, x, a) ∈ [0, T ]×Rd ×Rm ,
2
1 ⊤ γ mγ
q ψ (t, x, a) = − q2ψ (t, x) ◦ a − q1ψ (t, x) a − q1ψ (t, x) + log det q2ψ (t, x) − log 2π.
2 2 2
The next step in algorithm design is to update (θ, ψ) by enforcing the martingale con-
dition stipulated in Theorem 9 and applying the techniques developed in Jia and Zhou
(2022a). A number of algorithms can be developed based on two types of objectives: to
minimize the martingale loss function or to satisfy the martingale orthogonality conditions.
The latter calls for solving a system of equations for which there are further two differ-
ent techniques: applying stochastic approximation as in the classical temporal–difference
(TD) algorithms or minimizing a quadratic function in lieu of the system of equations as in
the generalized methods of moment (GMM). For reader’s convenience, we summarize these
methods in the q-learning context below.
18
Continuous-Time q-Learning
This method is intrinsically offline because the loss function involves the whole horizon
[0, T ]; however we are free to choose optimization algorithms to update the parameters
(θ, ψ). For example, we can apply stochastic gradient decent to update
Z T
∂J θ ψ
θ ← θ + αθ (t, Xtπ )Gt:T dt
0 ∂θ
Z TZ T
∂q ψ ψ ψ
ψ ← ψ + αψ e−β(s−t) (s, Xsπ , aπ
s )dsGt:T dt,
0 t ∂ψ
ψ ψ RT ψ ψ πψ πψ
where Gt:T = e−β(T −t) h(XTπ )−J θ (t, Xtπ )+ t e−β(s−t) [r(s, Xsπ , aπ ψ
s )−q (s, Xs , as )]ds,
and αθ and αψ are the learning rates. We present Algorithm 1 based on this updat-
ing rule. Note that this algorithm is analogous to the classical gradient Monte Carlo
method or TD(1) for MDPs (Sutton and Barto, 2018) because full sample trajectories
are used to compute gradients.
• Choose two different test functions ξt and ζt that are both Ft -adapted vector-valued
stochastic processes, and consider the following system of equations in (θ, ψ):
Z T h i
ψ ψ ψ πψ πψ πψ
EP ξt dJ θ (t, Xtπ ) + r(t, Xtπ , aπ
t )dt − q ψ
(t, Xt , at )dt − βJ θ
(t, Xt )dt = 0,
0
and
Z T h i
P θ πψ πψ πψ ψ πψ πψ θ πψ
E ζt dJ (t, Xt ) + r(t, Xt , at )dt − q (t, Xt , at )dt − βJ (t, Xt )dt = 0.
0
or online by
h ψ ψ ψ
i
πψ πψ πψ
θ ← θ + αθ ξt dJ θ (t, Xtπ ) + r(t, Xtπ , aπ ψ θ
t )dt − q (t, Xt , at )dt − βJ (t, Xt )dt ,
h ψ ψ ψ
i
πψ πψ πψ
ψ ← ψ + αψ ζt dJ θ (t, Xtπ ) + r(t, Xtπ , aπt )dt − q ψ
(t, Xt , at )dt − βJ θ
(t, Xt )dt .
θ ψ∂q ψ ψ ψ
Typical choices of test functions are ξt = ∂J π π π
∂θ (t, Xt ), ζt = ∂ψ (t, Xt , at ) yielding
algorithms that would be closest to TD-style policy evaluation and Q-learning (e.g., in
Tallec et al. 2019) and at the same time belong to the more general semi-gradient meth-
ods (Sutton and Barto, 2018). We present the online and offline q-learning algorithms,
Algorithms 2 and 3 respectively, based on these test functions. We must, however,
13. Here, when implementing in a computer program, dJ θ is the timestep-wise difference in J θ when time
is discretized, or indeed it is the temporal-difference (TD) of the value function approximator J θ .
19
Jia and Zhou
stress that testing against these two specific functions is theoretically not sufficient
to guarantee the martingale condition. Using them with a rich, large-dimensional
parametric family for J and q makes approximation to the martingale condition finer
and finer. Moreover, the corresponding stochastic approximation algorithms are not
guaranteed to converge in general, and the test functions have to be carefully selected
(Jia and Zhou, 2022a) depending on (J, q). Some of the different test functions leading
to different types of algorithms in the context of policy evaluation are discussed in Jia
and Zhou (2022a).
• Choose the same types of test functions ξt and ζt as above but now minimize the
GMM objective functions:
Z T h i⊤
P θ πψ πψ πψ ψ πψ πψ θ πψ
E ξt dJ (t, Xt ) + r(t, Xt , at )dt − q (t, Xt , at )dt − βJ (t, Xt )dt
0
Z T h i
P θ πψ πψ πψ ψ πψ πψ θ πψ
Aθ E ξt dJ (t, Xt ) + r(t, Xt , at )dt − q (t, Xt , at )dt − βJ (t, Xt )dt ,
0
and
Z T h ψ ψ ψ ψ ψ ψ
i⊤
θ
EP
ζt dJ (t, Xtπ ) + r(t, Xtπ , aπ
t )dt −q ψ
(t, Xtπ , aπ
t )dt − βJ θ
(t, Xtπ )dt
0
Z T h i
P θ πψ πψ πψ ψ πψ πψ θ πψ
Aψ E ζt dJ (t, Xt ) + r(t, Xt , at )dt − q (t, Xt , at )dt − βJ (t, Xt )dt ,
0
L
where Aθ ∈ SL ψ
++ , Aψ ∈ S++ . Typical choices of these matrices are Aθ = ILθ and
θ
T RT
Aψ = ILψ , or Aθ = (EP [ 0 ξt ξt⊤ dt])−1 and Aψ = (EP [ 0 ζt ζt⊤ dt])−1 . Again, we refer
R
the reader to Jia and Zhou (2022a) for discussions on these choices and the connection
with the classical GTD algorithms and GMM method.
Our q-learning method is to learn separately the zeroth-order and first-order terms of
Q∆t (t, x, a; π) in ∆t, and the two terms are in themselves independent of ∆t. Our ap-
proach therefore has a true “continuous-time” nature without having to rely on ∆t or any
time disretization, which theoretically facilitates the analysis carried out in the previous
section and algorithmically mitigates the high sensitivity with respect to time discretiza-
tion. Our approach is similar to that introduced in Baird (1994) and Tallec et al. (2019)
where the value function and the (rescaled) advantage function are approximated separately.
20
Continuous-Time q-Learning
Update θ and ψ by
K−1
X ∂J θ
θ ← θ + l(j)αθ (tk , xtk )Gtk :T ∆t.
∂θ
k=0
K−1
"K−1 #
X X ∂q ψ
ψ ← ψ + l(j)αψ (tk , xtk , atk )∆t Gtk :T ∆t.
∂ψ
k=0 i=k
end for
21
Jia and Zhou
K−1
X
ζti J θ (ti+1 , xti+1 ) − J θ (ti , xti ) + rti ∆t − q ψ (ti , xti , ati )∆t − βJ θ (ti , xti )∆t .
∆ψ =
i=0
Update θ and ψ by
θ ← θ + l(j)αθ ∆θ.
ψ ← ψ + l(j)αψ ∆ψ.
end for
22
Continuous-Time q-Learning
δ = J θ (tk+1 , xtk+1 ) − J θ (tk , xtk ) + rtk ∆t − q ψ (tk , xtk , atk )∆t − βJ θ (tk , xtk )∆t,
∆θ = ξtk δ,
∆ψ = ζtk δ.
Update θ and ψ by
θ ← θ + l(j)αθ ∆θ.
ψ ← ψ + l(j)αψ ∆ψ.
Update k ← k + 1
end while
end for
23
Jia and Zhou
However, as pointed out already, as much as the two approaches may lead to certain similar
algorithms, they are different conceptually. The ones in Baird (1994) and Tallec et al. (2019)
are still based on discrete-time MDPs and hence depend on the size of time-discretization.
By contrast, the value function and q-function in q-learning are well-defined quantities in
continuous time and independent of any time-discretization.
On the other hand, approximating the value function by J θ and the q-function by q ψ
separately yields a specific approximator of the Q-function by
Qθ,ψ θ ψ
∆t (t, x, a) ≈ J (t, x) + q (t, x, a)∆t. (32)
With this Q-function approximator Qθ,ψ , one of our q-learning based algorithms actually
recovers (a modification of) SARSA, one of the most well-known conventional Q-learning
algorithm.
To see this, consider state Xt at time t. Then the conventional SARSA with the ap-
proximator Qθ,ψ updates parameters (θ, ψ) by
ψ ψ
aπ ψ ψ aπ
θ ← θ + αθ Qθ,ψ π ψ π
∆t (t + ∆t, Xt+∆t , at+∆t ) − γ log π (at+∆t |t + ∆t, Xt+∆t )∆t
t t
∂Qθ,ψ
πψ ψ πψ ψ
− Qθ,ψ
∆t (t, Xt , at ) + r(t, Xt , aπ
t )∆t − βQθ,ψ
∆t (t, Xt , at )∆t
∆t
(t, Xt , aπ
t ),
∂θ
ψ ψ
aπ ψ ψ aπ
ψ ← ψ + αψ Qθ,ψ π ψ π
∆t (t + ∆t, Xt+∆t , at+∆t ) − γ log π (at+∆t |t + ∆t, Xt+∆t )∆t
t t
∂Qθ,ψ
πψ ψ πψ ψ
− Qθ,ψ
∆t (t, Xt , at ) + r(t, Xt , aπ
t )∆t − βQθ,ψ
∆t (t, Xt , at )∆t
∆t
(t, Xt , atπ ),
∂ψ
(33)
ψ ψ at
where aπ
t ∼ π ψ (·|t, Xt ) and aπ ψ
t+∆t ∼ π (·|t + ∆t, Xt+∆t ).
∂Qθ,ψ
∆t
θ
∂J∆t ∂Qθ,ψ
∆t ∂q ψ
≈ , ≈ ∆t ∆t.
∂θ ∂θ ∂ψ ∂ψ
∂Qθ,ψ ∂Qθ,ψ
This means that ∂ψ∆t has an additional order of ∆t when compared with ∂θ∆t , resulting
in a much slower rate of update on ψ than θ. This observation is also made by Tallec et al.
(2019), who suggest as a remedy modifying the update rule for Q-function by replacing the
partial derivative of the Q-function with that of the advantage function. In our setting, this
∂Qθ,ψ ψ
∂q∆t
means we change ∆t
∂ψ to ∂ψ in the second update rule of (33).
24
Continuous-Time q-Learning
Now, the terms in the brackets in (33) can be further rewritten as:
ψ ψ
aπ ψ ψ aπ
Qθ,ψ π ψ π
∆t (t + ∆t, Xt+∆t , at+∆t ) − γ log π (at+∆t |t + ∆t, Xt+∆t )∆t
t t
ψ ψ ψ
− Qθ,ψ π π θ,ψ π
∆t (t, Xt , at ) + r(t, Xt , at )∆t − βQ∆t (t, Xt , at )∆t
ψ ψ
aπ aπ ψ ψ
=J θ (t + ∆t, Xt+∆t
t
) − J θ (t, Xt ) + [q ψ (t + ∆t, Xt+∆t
t
, aπ ψ π
t+∆t ) − q (t, Xt , at )]∆t
ψ
ψ ψ ψ aπ
+ r(t, Xt , aπ θ ψ π 2 ψ π
t )∆t − βJ (t, Xt )∆t − βq (t, Xt , at )(∆t) − γ log π (at+∆t |t + ∆t, Xt+∆t )∆t
t
ψ
aπ ψ ψ
=J θ (t + ∆t, Xt+∆t
t
) − J θ (t, Xt ) − q ψ (t, Xt , aπ π θ
t )∆t + r(t, Xt , at )∆t − βJ (t, Xt )∆t
ψ ψ
aπ πψ ψ πψ aπ ψ
+ q (t + ∆t, Xt+∆t , at+∆t ) − γ log π (at+∆t |t + ∆t, Xt+∆t ) ∆t − βq ψ (t, Xt , aπ
ψ t t
t )(∆t)
2
ψ ψ
≈dJ θ (t, Xt ) − q ψ (t, Xt , aπ π θ
t )dt + r(t, Xt , at )dt − βJ (t, Xt )dt
ψ ψ
ψ aπ πψ ψ πψ aπ
+ q (t + ∆t, Xt+∆t , at+∆t ) − γ log π (at+∆t |t + ∆t, Xt+∆t ) dt,
t t
| {z }
mean-0 term due to (29)
where in the last step we drop the higher-order small term (∆t)2 while replacing the differ-
ence terms with differential ones.
θ
Comparing (33) with the updating rule in Algorithm 3 using test functions ξt = ∂J πψ
∂θ (t, Xt )
ψ ψ ψ
and ζt = ∂q π π
∂ψ (t, Xt , at ), we find that the modified SARSA in Tallec et al. (2019) and
Algorithm 3 differ only by a term whose mean is 0 due to the constraint (29) on the q-
function. This term is solely driven by the action randomization at time t + ∆t. Hence, the
algorithm in Tallec et al. (2019) is noisier and may be slower in convergence speed compared
with the q-learning one. In Section 7, we will numerically compare the results of the above
Q-learning SARSA and q-learning.
25
Jia and Zhou
exp{ γ1 q(t, x, ·; π ϕ )}
!
ϕ′
ϕ′ 1 ϕ
min DKL π (·|t, x) R ≡ min DKL π (·|t, x) exp{ q(t, x, ·; π )} ,
ϕ′ ∈Φ
A
exp{ γ1 q(t, x, a; π ϕ )}da ′
ϕ ∈Φ γ
where DKL f g := A log fg(a) (a)
R
f (a)da is the Kullback–Leibler (KL) divergence of two
positive functions f, g with the same support on A, where f ∈ P(A) is a probability density
function on A.
This procedure may not eventually lead to the desired target policy, but the newly
defined policy still provably improves the previous policy, as indicated in the following
stronger version of the policy improvement theorem that generalizes Theorem 2.
∂2J
′ 1 ∂J
DKL π (·|t, x) exp{ H t, x, ·, (t, x; π), 2 (t, x; π) }
γ ∂x ∂x
∂2J
1 ∂J
≤DKL π(·|t, x) exp{ H t, x, ·, (t, x; π), 2 (t, x; π) } ,
γ ∂x ∂x
Theorem 10 in itself is a general result comparing two given policies not necessarily
within any tractable family of densities. However, an implication of the result is that, given
′
a current tractable policy π ϕ , as long as we update it to a new tractable policy π ϕ with
ϕ′ 1 ϕ ϕ 1 ϕ
DKL π (·|t, x) exp{ q(t, x, ·; π )} ≤ DKL π (·|t, x) exp{ q(t, x, ·; π )} ,
γ γ
′
then π ϕ improves upon π ϕ . Thus, to update π ϕ it suffices to solve the optimization problem
ϕ′ 1 ϕ
min DKL π (·|t, x) exp{ q(t, x, ·; π )}
ϕ′ ∈Φ γ
Z
ϕ′ 1 ′
= min log π (a|t, x) − q(t, x, a; π ) π ϕ (a|t, x)da.
ϕ
′
ϕ ∈Φ A γ
26
Continuous-Time q-Learning
q(t, Xtπ , aπ π π π π
t ; π)dt = dJ(t, Xt ; π) + r(t, Xt , at )dt − βJ(t, Xt ; π)dt + {· · · }dWt . (35)
Plugging the above to (34) and ignoring the martingale difference term {· · · }dWt whose
mean is 0, the gradient-based updating rule for ϕ becomes
h ϕ
i
πϕ πϕ πϕ πϕ πϕ
ϕ ← ϕ + αϕ −γ log π ϕ (aπt |t, Xt )dt + dJ(t, Xt ; π ϕ
) + r(t, Xt , at )dt − βJ(t, Xt ; π ϕ
)dt
∂ ϕ πϕ
× log π ϕ (aπ
t |t, Xt ).
∂ϕ
(36)
In this updating rule, we only need to approximate J(·, ·; π ϕ ), which is a problem of PE.
Specifically, denote by J θ (·, ·) the function approximator of J(·, ·; π ϕ ), where θ ∈ Θ, and
we may apply any PE methods developed in Jia and Zhou (2022a) to learn J θ . Given this
value function approximation, the updating rule (36) further specializes to
h i
ϕ πϕ πϕ θ πϕ πϕ πϕ θ πϕ
ϕ ← ϕ + αϕ −γ log π (at |t, Xt )dt + dJ (t, Xt ) + r(t, Xt , at )dt − βJ (t, Xt )dt
∂ ϕ πϕ
× log π ϕ (aπ
t |t, Xt ).
∂ϕ
(37)
Expression (37) coincides with the updating rule in the policy gradient method established
in Jia and Zhou (2022b) when the regularizer therein is taken to be the entropy. Moreover,
since we learn the value function and the policy simultaneously, the updating rule leads to
actor–critic type of algorithms. Finally, with this method, while learning the value function
may involve martingale conditions as in Jia and Zhou (2022a), learning the policy does not.
Schulman et al. (2017) note the equivalence between soft Q-learning and PG in discrete
time. Here we present the continuous-time counterpart of the equivalence, nevertheless with
a theoretical justification, which recovers the PG based algorithms in Jia and Zhou (2022b)
when combined with suitable PE methods.
27
Jia and Zhou
∂2J
Z
∂J
(x; π), 2 (x; π) − γ log π(a|x) π(a|x)da − V (π) = 0, x ∈ Rd . (38)
H x, a,
A ∂x ∂x
∂2J
′ 1 ∂J
DKL π (·|x) exp{ H x, ·, (x; π), 2 (x; π) }
γ ∂x ∂x
∂2J
1 ∂J
≤DKL π(·|x) exp{ H x, ·, (x; π), 2 (x; π) } ,
γ ∂x ∂x
then V (π ′ ) ≥ V (π).
28
Continuous-Time q-Learning
value function. Lastly, since the value does not depend on the initial time, we will fix the
latter as 0 in the following discussions and applications of ergodic learning tasks.
As with the episodic case, for any admissible policy π, we can define the ∆t-parameterized
Q-function as
Q∆t (x, a; π)
Z t+∆t
T
Z
PW a
=E [r(Xs , a) − V (π)]ds + lim E P
[r(Xsπ , aπ π π
s ) − γ log π(as |Xs )
t T →∞ t+∆t
a
Xtπ = x
− V (π)]ds|Xt+∆t
Z t+∆t
PW a a π
=E [r(Xs , a) − V (π)]ds + J(Xt+∆t ; π) Xt = x
t
PW π a
P
π
−E lim E J(XT ; π)|Xt+∆t Xt = x
T →∞
Z t+∆t
PW a a π
=E [r(Xs , a) − V (π)]ds + J(Xt+∆t ; π) − J(x; π) Xt = x + J(x; π)
t
PW π a
P
π
−E lim E J(XT ; π)|Xt+∆t Xt = x
T →∞
Z t+∆t
∂2J
PW
a ∂J a a
π
=E H Xs , a, (X ; π), 2 (Xs ; π) − V (π) ds Xt = x + J(x; π)
t ∂x s ∂x
W
lim EP J(XTπ ; π)|Xt+∆t a
Xtπ = x
− EP
T →∞
∂2J
∂J
(x; π), 2 (x; π) − V (π) ∆t − J¯ + O (∆t)2 ,
=J(x; π) + H x, a,
∂x ∂x
where we assumed that limT →∞ EP J(XTπ ; π)|Xtπ = x = J¯ is a constant for any (t, x) ∈
[0, +∞) × Rd .
The corresponding q-function is then defined as
∂2J
∂J
q(x, a; π) = H x, a, (x; π), 2 (x; π) − V (π).
∂x ∂x
Then
29
Jia and Zhou
(i) V̂ , Jˆ and q̂ are respectively the value, the value function and the q-function associated
with π if and only if for all x ∈ Rd , the following process
Z t
ˆ π
J(Xt ) + [r(Xsπ , aπ π π
s ) − q̂(Xs , as ) − V̂ ]ds (41)
0
is an ({Ft }t≥0 , P)-martingale, where {Xtπ , 0 ≤ t < ∞} is the solution to (6) with
X0π = x.
(ii) If V̂ , Jˆ and q̂ are respectively the value, value function and the q-function associated
with π, then for any admissible π ′ and any x ∈ Rd , the following process
Z t
ˆ π′ ′ ′ π′ π′
J(Xt ) + [r(Xuπ , aπ
u ) − q̂(Xu , au ) − V̂ ]du (42)
0
′
is an ({Ft }t≥0 , P)-martingale, where {Xtπ , 0 ≤ t < ∞ is the solution to (6) under π ′
′
with initial condition X0π = x.
(iii) If there exists an admissible π ′ such that for all x ∈ Rd , (42) is an ({Ft }t≥0 , P)-
′
martingale where X0π = x, then V̂ , Jˆ and q̂ are respectively the value, value function
and the q-function associated with π.
exp{ γ1 q̂(x,a)}
Moreover, in any of the three cases above, if it holds further that π(a|x) = R
exp{ γ1 q̂(x,a)}da
,
A
Based on Theorem 12, we can design corresponding q-learning algorithms for ergodic
problems, in which we learn V, J θ and q ψ at the same time. These algorithms have natural
connections with SARSA as well as the PG-based algorithms developed in Jia and Zhou
(2022b). The discussions are similar to those in Subsections 4.2 and 5.2 and hence are omit-
ted here. An online algorithm for ergodic tasks is presented as an example; see Algorithm
4.
7 Applications
7.1 Mean–variance portfolio selection
We first review the formulation of the exploratory mean–variance portfolio selection prob-
lem, originally proposed by Wang and Zhou (2020) and later revisited by Jia and Zhou
(2022b).15 The investment universe consists of one risky asset (e.g., a stock index) and one
risk-free asset (e.g., a saving account) whose risk-free interest rate is r. The price of the
risky asset {St , 0 ≤ t ≤ T } is governed by a geometric Brownian motion with drift µ and
volatility σ > 0, defined on a filtered probability space (Ω, F, PW ; {FtW }0≤t≤T ). An agent
has a fixed investment horizon 0 < T < ∞ and an initial endowment x0 . A self-financing
portfolio is represented by the real-valued adapted process a = {at , 0 ≤ t ≤ T }, where at
15. There is a vast literature on classical model-based (non-exploratory) continuous-time mean–variance
models formulated as stochastic control problems; see e.g. Zhou and Li (2000); Lim and Zhou (2002);
Zhou and Yin (2003) and the references therein.
30
Continuous-Time q-Learning
is the discounted dollar value invested in the risky asset at time t. Then her discounted
wealth process follows
d(e−rt St )
dXt = at [(µ − r)dt + σdWt ] = at , X 0 = x0 .
e−rt St
31
Jia and Zhou
The agent aims to minimize the variance of the discounted value of the portfolio at time
T while maintaining the expected return to be a certain level; that is,
where z is the target expected return, and the variance and expectation throughout this
subsection are with respect to the probability measure PW .
The exploratory formulation of this problem with entropy regularizer is equivalent to
Z T
π 2
J(t, x; w) = − max E − (X̃T − w) − γ log π(as |s, X̃ )ds X̃t = x + (w − z)2 , (44)
π π π
π t
subject to
Z sZ
dX̃sπ = (µ − r) aπ(a|s, X̃sπ )dads + σ a2 π(a|s, X̃sπ )dadWs ; X̃tπ = x,
R R
where w is the Lagrange multiplier introduced to relax the expected return constraint; see
Wang and Zhou (2020) for a derivation of this formulation. Note here we artificially add
the minus sign to transform the variance minimization to a maximization problem to be
consistent with our general formulation.
We now follow Wang and Zhou (2020) and Jia and Zhou (2022b) to parameterize the
value function by
32
Continuous-Time q-Learning
K−1
X
ζti J θ (ti+1 , xti+1 ; w) − J θ (ti , xti ; w) − q ψ (ti , xti , ati ; w)∆t .
∆ψ =
i=0
Update θ and ψ by
θ ← θ + l(j)αθ ∆θ.
ψ ← ψ + l(j)αψ ∆ψ.
Update w (Lagrange multiplier) every m episodes:
if j ≡ 0 mod m then
j
1 X (i)
w ← w − αw XT .
m
i=j−m+1
end if
end for
33
Jia and Zhou
be learned for N times. After training completes, for each market configuration we apply
the learned policy out-of-sample repeatedly for 100 times and compute the average mean,
variance and Sharpe ratio for each algorithm. We are particularly interested in the im-
pact of time discretization; so we experiment with three different time discretization steps:
1 1 1
∆t ∈ { 25 , 250 , 2500 }. Note that all the algorithms rely on time discretization at the imple-
mentation stage, where ∆t is the sampling frequency required for execution or computation
of numerical integral for PG and q-learning. For the Q-learning, ∆t plays a dual role: it is
both the parameter in its definition and the time discretization size in implementation.
The numerical results are reported in Tables 1 – 3, each table corresponding to a different
value of ∆t . We observe that for any market configuration, the Q-learning is almost always
the worst performer in terms of all the metrics, and even divergent in certain high volatility
and low return environment (when keeping the same learning rate as in the other cases).
The other two algorithms have very close overall performances, although the q-learning
tends to outperform in high volatile market environments. On the other hand, notably, the
results change only slightly with different time discretizations, for all the three algorithms.
This observation does not align with the claim in Tallec et al. (2019) that the classical Q-
learning is very sensitive to time-discretization, at least for this specific application problem.
In addition, we notice that q-learning and PG-based algorithm produce comparable results
under most market scenarios, while the terminal variance produced by q-learning tends to
be significantly higher than that by PG when |µ| is small or σ is large. In such cases, the
ground truth solution requires higher leverage in risky assets in order to reach the high
target expected return (40% annual return). However, it appears that overall q-learning
strives to learn to keep up with the high target return and as a result ends up with high
terminal variance in those cases. By contrast, the PG tends to maintain lower variance at
the cost of significantly underachieving the target return.
and the goal is to maximize the long term average quadratic payoff
Z T
1
lim inf E r(Xt , at )dt|X0 = x0 ,
T →∞ T 0
with r(x, a) = −( M 2 N 2
2 x + Rxa + 2 a + P x + Qa).
The exploratory formulation of this problem with entropy regularizer is equivalent to
Z T
1 π π π π π
lim inf E r(Xt , at )dt − γ log π(at |Xt )dt X0 = x0 .
T →∞ T 0
This problem falls into the general formulation of an ergodic task studied in Section
6; so we directly implement Algorithm 4 in our simulation. In particular, our function
approximators for J and q are quadratic functions:
e−ψ3 γ
J θ (x) = θ1 x2 + θ2 x, q ψ (x, a) = − (a − ψ1 x − ψ2 )2 − (log 2πγ + ψ3 ),
2 2
34
Continuous-Time q-Learning
35
Jia and Zhou
36
Continuous-Time q-Learning
37
Jia and Zhou
where the form of the q-function is derived to satisfy the constraint (29), as discussed in
Example 1. The policy associated with this parameterized q-function is π ψ (·|x) = N (ψ1 x +
ψ2 , γeψ3 ). In addition, we have another parameter V that stands for the long-term average.
We then compare our online q-learning algorithm against two theoretical benchmarks
and two other online algorithms, in terms of the running average reward during the learning
process. The first benchmark is the omniscient optimal level, namely, the maximum long
term average reward that can be achieved with perfect knowledge about the environment
and reward (and hence exploration is unnecessary and only deterministic policies are con-
sidered). The second benchmark is the omniscient optimal level less the exploration cost,
which is the maximum long term average reward that can be achieved by the agent who
is however forced to explore under entropy regularization. The other two algorithms are
the PG-based ergodic algorithm proposed in (Jia and Zhou, 2022b, Algorithm 3) and the
∆t-based Q-learning algorithm presented in Appendix B.
In our simulation, to ensure the stationarity of the controlled state process, we set
A = −1, B = C = 0 and D = 1. Moreover, we set x0 = 0, M = N = Q = 2, R = P = 1,
and γ = 0.1. Learning rates for all algorithms are initialized as αψ = 0.001, and decay
38
Continuous-Time q-Learning
(a) The path of learned ψ1 . (b) The path of learned ψ2 . (c) The path of learned eψ3 .
according to l(t) = max{1,1√log t} . We also repeat the experiment for 100 times under different
seeds to generate samples.
First, we fix ∆t = 0.1 and run the three learning algorithms for sufficiently long time
T = 106 . All the parameters to be learned are initialized as 0 except for the initial variance
of the policies which is set to be 1. Figure 1 plots the running average reward trajectories,
along with their standard deviations (which are all too small to be visible in the figure),
39
Jia and Zhou
as the learning process proceeds. Among the three methods, both the PG and q-learning
algorithms perform significantly better and converge to the optimal level much faster than
the Q-learning. The q-learning algorithm is indeed slightly better than the PG one. Figure
2 shows the iterated values of the leaned parameters for the three algorithms. Eminently,
the Q-learning has the slowest convergence. The other two algorithms, while converging at
similar speed eventually, exhibit quite different behaviors on their ways to convergence in
learning ψ1 and ψ2 . The PG seems to learn these parameters faster than the q-learning
at the initial stage, but then quickly overshoot the optimal level and take a long time to
correct it, while the q-learning appears much smoother and does not have such an issue.
At last, we vary the ∆t value to examine its impact on the performances of the three
algorithms. We vary ∆t ∈ {1, 0.1, 0.01} and set T = 105 for each experiment. The learning
rates and parameter initializations are fixed and the same for all the there methods. We
repeat the experiment for 100 times and present the results in terms of the average run-
ning reward in Figure 3. We use the shaded area to denote the standard deviation. It is
clear that the Q-learning algorithm is very sensitive to ∆t. In particular, its performance
worsens significantly as ∆t becomes smaller. When ∆t = 0.01, the algorithm almost has
no improvement at all over time. This drawback of sensitivity in ∆t is consistent with the
observations made in Tallec et al. (2019). On the other hand, the other two algorithms
show remarkable robustness across different discretization sizes.
40
Continuous-Time q-Learning
functions are not well approximated by finite sums with coarse time discretization (∆t = 1);
hence the corresponding martingale conditions under the learned parameters are not close
to the theoretical continuous-time martingale conditions.
8 Conclusion
The previous trilogy on continuous-time RL under the entropy-regularized exploratory dif-
fusion process framework (Wang et al., 2020; Jia and Zhou, 2022a,b) have respectively
studied three key elements: optimal samplers for exploration, PE and PG. PG is a special
41
Jia and Zhou
instance of policy improvement, and in the discrete-time MDP literature, it can be done
through Q-function. However, there was no satisfying Q-learning theory in continuous time,
so a different route was taken in Jia and Zhou (2022b) – to turn the PG task into a PE
problem.
This paper fills this gap and adds yet another essential building block to the theoretical
foundation of continuous-time RL, by developing a q-learning theory commensurate with
the continuous-time setting, attacking the general policy improvement problem, and cov-
ering both on- and off-policy learning. Although the conventional Q-function provides no
information about actions in continuous time, its first-order approximation, which we call
the q-function, does. Moreover, it turns out that the essential component of the q-function
relevant to policy updating is the Hamiltonian associated with the problem, the latter hav-
ing already played a vital role in the classical model-based stochastic control theory. This
fact, together with the expression of the optimal stochastic policies explicitly derived in
Wang et al. (2020), in turn, explains and justifies the widely used Boltzmann exploration
in classical RL for MDPs.
We characterize the q-function as the compensator to maintain the martingality of a
process consisting of the value function and cumulative reward, both in the on-policy and off-
policy settings. This characterization enables us to use the martingale-based PE techniques
developed in Jia and Zhou (2022a) to simultaneously learn the q-function and the value
function. Interestingly, the martingality is on the same process as in PE (Jia and Zhou,
2022a) but with respect to an enlarged filtration containing the policy randomization. The
TD version of the resulting algorithm links to the well-known SARSA algorithm in classical
Q-learning.
A significant and outstanding research question is an analysis on the convergence rates
of q-learning algorithms. The existing convergence rate results for discrete-time Q-learning
cannot be extended to continuous-time q-learning due to the continuous state space and
general nonlinear function approximations, along with the fact that the behaviors of the Q-
function and the q-function can be fundamentally different. A possible remedy is to carry
out the convergence analysis within the general framework of stochastic approximation,
which is poised to be the subject of a substantial future study.
We make a final observation to conclude this paper. The classical model-based approach
typically separates “estimation” and “optimization”, à la Wonham’s separation theorem
(Wonham, 1968) or the “plug-in” method. This approach first learns a model (including
formulating and estimating its coefficients/parameters) and then optimizes over the learned
model. Model-free (up to some basic dynamic structure such as diffusion processes or
Markov chains) RL takes a different route: it skips estimating a model and learns optimizing
policies directly via PG or Q/q-learning. The q-learning theory in this paper suggests
that, to learn policies, one needs to learn the q-function or the Hamiltonian. In other
words, it is the Hamiltonian which is a specific aggregation of the model coefficients, rather
than each and every individual model coefficient, that needs to be learned/estimated for
optimization. Clearly, from a pure computational standpoint, estimating a single function
– the Hamiltonian – is much more efficient and robust than estimating multiple functions
(b, σ, r, h in the present paper) in terms of avoiding or reducing over-parameterization,
sensitivity to errors and accumulation of errors. More importantly, however, (35) hints
that the Hamiltonian/q-function can be learned through temporal differences of the value
42
Continuous-Time q-Learning
function, so the task of learning and optimizing can be accomplished in a data-driven way.
This would not be the case if we chose to learn individual model coefficients separately. This
observation, we believe, highlights the fundamental differences between the model-based and
model-free approaches.
Zhou is supported by a start-up grant and the Nie Center for Intelligent Asset Manage-
ment at Columbia University. His work is also part of a Columbia-CityU/HK collaborative
project that is supported by the InnoHK Initiative, The Government of the HKSAR, and
the AIFT Lab. We are grateful for comments from seminar and conference participants at
Chinese University of Hong Kong, University of Iowa, Columbia University, Seoul National
University, POSTECH, Ritsumeikan University, Shanghai University of Finance and Eco-
nomics, Fudan University, The 2022 INFORMS Annual Meeting in Indianapolis, The 11th
Annual Meeting of FE Branch in OR Society of China in Shijiazhuang, Conference in Mem-
ory of Tomas Björk in Stockholm, and Post-Bachelier Congress Workshop in Hong Kong.
In particular, we benefit from discussions with Jiro Akahori, Xuefeng Gao, Bong-Gyu Jang,
Hyeng Keun Koo, Lingfei Li, Hideo Nagai, Jun Sekine, Wenpin Tang, and David Yao. We
are especially indebted to the three anonymous referees for their constructive and detailed
comments that have led to an improved version of the paper.
43
Jia and Zhou
Adding the entropy term −γ log π(a|x) to both sides of (46), integrating over a and noting
(45), we obtain a relation between the Q-function and the value function :
Moreover, substituting J(X1a ; π) with (47) to the right hand side of (46), we further obtain
h i
Q(x, a; π) = r(x, a) + βE Q(X1a , aπ π a π π
1 ; π) − γ log π(a1 |X1 ) X0 = x, a0 = a . (48)
Applying suitable algorithms (e.g., stochastic approximation) to solve the equation (48)
leads to the classical SARSA algorithm.
On the other hand, rewrite (46) as
h i
J(x; π) = r(x, a) − [Q(x, a; π) − J(x; π)] + βE J(X1a ; π) X0π = x, aπ
0 = a . (49)
Recall that A(x, a; π) := Q(x, a; π) − J(x; π) is called the advantage function for MDPs.
The q-function in the main text (see (17)) is hence an advantage rate function in the
continuous-time setting. In particular, (47) is analogous to the second constraint on q-
function in (29).
44
Continuous-Time q-Learning
s−1 h i
′ ′ ′ ′ π′ π′
X
Msπ := β s J(Xsπ ; π) + β t r(Xtπ , aπ
t ) − A(Xt , at ; π)
t=0
is an ({Fs }s≥0 , P)-martingale under any policy π ′ , where Fs is the σ-algebra generated by
′ ′ π′ π′
X0π , aπ
0 , · · · , Xs , as . To see this, we compute
s−1
" #
h i
π′ π′ π′ π′ π′
X
E β s J(Xs ; π) + βt r(Xt , at ) − A(Xt , at ; π) Fs−1
t=0
h i Xs−1 h i
′ π′ ′ π′ π′ π′ π′
=β s E J(Xsπ ; π) Xs−1 , aπ
s−1 + β t
r(Xt , at ) − A(Xt , at ; π)
t=0
h i s−1 h i
π′ π′ ′ π′ ′ ′ ′ π′ π′
X
s−1
=β J(Xs−1 ; π) − r(Xs−1 , aπ
s−1 ) + A(Xs−1 , aπ
s−1 ; π) + β t r(Xtπ , aπ
t ) − A(Xt , at ; π)
t=0
s−2 h i
′ ′ ′ π′ π′
X
=β s−1 J(Xs−1
π
; π) + β t r(Xtπ , aπ
t ) − A(Xt , at ; π) .
t=0
(50)
Note that here Fs is larger than the usual “historical information set” up to time s,
′ ′ ′
denoted by Hs , which is the σ-algebra generated by X0π , aπ 0 , · · · , Xsπ without observing
′ π′ π′
the last aπs (i.e. Hs is Fs excluding as ). Because Ms is measurable to the smaller
σ-algebra Hs , it is automatically an ({Hs }s≥0 , P)-martingale.
′
The above analysis shows that Msπ is a martingale for any π ′ – whether the target policy
π or a behavior one – so long as the advantage function A is taken as the compensator in
′
defining Msπ . This is the essential reason why Q-learning works for both on- and off-
policy learning. However, the conclusion is not necessarily true in other types of learning
methods. For example,h let us take a different compensatori and investigate the process
s π ′ Ps−1 t π ′ π ′ π ′ π ′
β J(Xs ; π) + t=0 β r(Xt , at ) − γ log π(at |Xt ) . The continuous-time counterpart
of this process is a martingale required in the policy gradient method; see (Jia and Zhou,
2022b, Theorem 4).
Now, conditioned on Hs−1 and when π ′ = π, we have
s−1
" #
X
E β s J(Xsπ ; π) + β t [r(Xtπ , aπ π π
t ) − γ log π(at |Xt )] Hs−1
t=0
h s−1
π i X
=β s E E J(Xsπ ; π) Xs−1
π
, aπ
s−1 Xs−1 + β t [r(Xtπ , aπ π π
t ) − γ log π(at |Xt )]
t=0
h i s−1
X
s−1 π π
=β J(Xs−1 ; π) − r(Xs−1 , aπ
s−1 ) + π
A(Xs−1 , aπ
s−1 ; π)
π
Xs−1 + β t [r(Xtπ , aπ π π
t ) − γ log π(at |Xt )]
t=0
s−2
X
=β s−1 J(Xs−1
π
; π) + β t [r(Xtπ , aπ π π
t ) − γ log π(at |Xt )] ,
t=0
45
Jia and Zhou
where the last equality is due to (47), which can be applied because aπ
s−1 is generated under
the policy π. This shows that the process is an ({Hs }s≥0 , P)-martingale when π ′ = π,
which in turn underpins the on-policy learning.
However, when conditioned on Fs−1 , we have
s−1
" #
h i
π′ π′ π′ π′ π′
X
s t
E β J(Xs ; π) + β r(Xt , at ) − γ log π(at |Xt ) Fs−1
t=0
h i s−1 h i
′ π′ ′ ′ ′ π′ π′
X
s
=β E J(Xsπ ; π) Xs−1 , aπ
s−1 + β t r(Xtπ , aπ
t ) − γ log π(a t |Xt )
t=0
h i Xs−1 h i
π′ π′ ′ π′ π′ π′ π′ π′ π′
=β s−1 J(Xs−1 ; π) − r(Xs−1 , aπ
s−1 ) + A(X , a
s−1 s−1 ; π) + β t
r(Xt , at ) − γ log π(at |Xt )
t=0
s−2 h i
′ ′ ′ π′ π′
X
=β s−1 J(Xs−1
π
; π) + β t r(Xtπ , aπ
t ) − γ log π(a t |Xt )
t=0
h i
s−1 π′ ′ π′ π′
+β A(Xs−1 , aπ
s−1 ; π) − γ log π(a t |Xt ) .
π , aπ ; π) = ′ ′
So the same process is not an ({Fs }s≥0 , P)-martingale in general, unless A(Xs−1 s−1
π ′ π ′ π ′ π ′
γ log π(at |Xt ) for any Xs−1 , as−1 . Two conclusions can be drawn from this example: on
one hand, policy gradient works for on-policy but not off-policy, and on the other hand, it
is important to choose the correct filtration to ensure martingality and hence the correct
test functions in designing learning algorithms.
The analysis in (50) yields a joint martingale characterization of (J, A), analogous to
Theorem 7. It is curious that, to our best knowledge, such a martingale condition, while
not hard to derive in the MDP setting, has not been explicitly introduced nor utilized to
study Q-learning in the existing RL literature.
X 1
softγ max Q∗ (x, a) := Ea∼π∗ [Q∗ (x, a) − γ log π ∗ (a|x)] ≡ γ log exp{ Q∗ (x, a)}.
a γ
a∈A
46
Continuous-Time q-Learning
Then the Bellman equation becomes J ∗ (x) = softγ maxa Q∗ (x, a) and (52) reduces to
∗ ∗
Q (x, a) = r(x, a) + βE softγ max
′
Q (X1a , a′ ) X0 = x, a0 = a , (53)
a
which is the foundation for (off-policy) Q-learning algorithms; see e.g. (Sutton and Barto,
2018, p. 131).
On the other hand, because J ∗ (x) = softγ maxa Q∗ (x, a) = γ log a∈A exp{ γ1 Q∗ (x, a)},
P
we have
X 1 ∗ ∗
exp [Q (x, a) − J (x)] = 1. (54)
γ
a∈A
Recall the advantage function A∗ (x, a) = Q∗ (x, a) − J ∗ (x). Thus (54) is analogous to (25).
Moreover, rearranging (52) yields
h i
J ∗ (x) = r(x, a) − [Q∗ (x, a) − J ∗ (x)] + βE J ∗ (X1a ) X0 = x, a0 = a .
s−1
X
β s J ∗ (Xsπ ) + β t [r(Xtπ , aπ ∗ π π
t ) − A (Xt , at )]
t=0
is an ({Fs }s≥0 , P)-martingale for any policy π; hence it is analogous to the martingale
characterization of optimal q-function in Theorem 9.
ψ
1
exp{ γ∆t Qψ
∆t (t, x, a)}
π (a|t, x) = R 1 ψ
.
A exp{ γ∆t Q∆t (t, x, a)}da
47
Jia and Zhou
Then, the conventional Q-learning method (e.g, SARSA, Sutton and Barto 2018, Chap-
ter 10) leads to the following updating rule of the Q-function parameters ψ:
ψ ← ψ + αψ Qψ at ψ at
∆t (t + ∆t, Xt+∆t , at+∆t ) − γ log π (at+∆t |t + ∆t, Xt+∆t )∆t
(55)
∂Qψ
− Qψ
∆t (t, Xt , at ) + r(t, Xt , at )∆t − βQψ
∆t (t, Xt , at )∆t
∆t
(t, Xt , at ),
∂ψ
Thus, following the same lines of argument as in Subsection 5.1, we derive the updating
rule of ϕ every step as
ϕ 1 ψ ∂
ϕ ← ϕ − αϕ log π (at |t, Xt ) − Q∆t (t, Xt , at ) log π ϕ (at |t, Xt ),
γ∆t ∂ϕ
where at ∼ π ϕ (·|t, Xt ).
∂J ∂2J ¯ (56)
Q∆t (x, a; π) = J(x; π) + H x, a, (x; π), 2 (x; π) − V (π) ∆t − J.
∂x ∂x
Then we have all the parallel results regarding relation among the Q-function, the value
function and the intertemporal increment of the Q-function, and devise algorithms accord-
ingly.
For illustration, we only present the results when
ψ
1
exp{ γ∆t Qψ
∆t (x, a)}
π (a|x) = R 1 ψ
A exp{ γ∆t Q∆t (x, a)}da
48
Continuous-Time q-Learning
1
Qψ
∆t (t, x, a; w) = −e
−ψ3 (T −t)
[(x−w)2 +ψ2 a(x−w)+ e−ψ1 a2 ]+ψ4 (t2 −T 2 )+ψ5 (t−T )+(w−z)2 .
2
Here we use −e−ψ1 < 0 to guarantee the concavity of the Q-function in a so that the
function can be maximized.
This Q-function in turn gives rise to an explicit form of policy, which is Gaussian:
1
π ψ (·|t, x; w) = N −ψ2 eψ1 (x − w), γ∆teψ3 (T −t)+ψ1 ∝ exp{ Qψ (t, x, ·)}.
γ∆t ∆t
The parameters in the Q-function can then be updated based on (55) and the Lagrange
multiplier can be updated similarly as in Algorithm 5.
e−ψ3
Qψ
∆t (x, a) = − (a − ψ1 x − ψ2 )2 + ψ4 x2 + ψ5 x.
2
49
Jia and Zhou
Here we omit the constant term since the value function is unique only up to a constant.
We also use −e−ψ3 < 0 to ensure that the Q-function is concave in a. The optimal value V
is an extra parameter.
The Q-function leads to
1
π ψ (·|x) = N ψ1 x + ψ2 , γ∆teψ3 ∝ exp{ Qψ (x, ·)}.
γ∆t ∆t
Finally, we compute the following derivative for use in Q-learning algorithms:
⊤
∂Qψ e−ψ3
∆t −ψ3 −ψ3 2 2
(x, a) = e (a − ψ1 x − ψ2 )x, e (a − ψ1 x − ψ2 ), (a − ψ1 x − ψ2 ) , x , x .
∂ψ 2
Then the parameters can be updated based on (57).
exp{ γ1 q(a)}
be given. Then π ∗ (a) = R
exp{ γ1 q(a)}da
∈ P(A) is the unique maximizer of the following
A
problem Z
max [q(a) − γ log π(a)]π(a)da. (58)
π(·)∈P(A) A
1
R
Proof Set ν = 1 − log A exp{ γ q(a)}da and consider a new problem:
Z
max [q(a) − γ log π(a)]π(a) + γνπ(a) da − γν. (59)
π(·)∈P(A) A
R
Noting A π(a)da = 1 due to π(·) ∈ P(A), problems (58) and (59) are equivalent; hence
we focus on problem (59). Relaxing the constraint of being a density function we deduce
Z
max [q(a) − γ log π(a)]π(a) + γνπ(a) da − γν
π(·)∈P(A) A
Z
≤ max [q(a) − γ log π(a)]π(a) + γνπ(a) da − γν (60)
π(·)>0 A
Z
≤ max [q(a) − γ log π]π + γνπ da − γν.
A π>0
The unique maximizer of the inner optimization on the right hand side of the above is
1 exp{ γ1 q(a)}
exp{ q(a) + ν − 1} = R 1 =: π ∗ (a).
γ A exp{ γ q(a)}da
As π ∗ ∈ P(A), it is the optimal solution to (59) or, equivalently, (58). The uniqueness
follows from that of the inner optimization in (60).
50
Continuous-Time q-Learning
Now we are ready to prove Theorem 2. To ease notation we use the q-function q even
though it had not been introduced prior to Theorem 2 in the main text. We also employ
the results of Theorem 6 whose proof does not rely on Theorem 2 nor its proof here.
Recall from (16), q(t, x, a; π) is defined to be
∂2J
∂J ∂J
q(t, x, a; π) = (t, x; π) + H t, x, a, (t, x; π), 2 (t, x; π) − βJ(t, x; π),
∂t ∂x ∂x
with the Hamiltonian H defined in (3).
Because π ′ = Iπ, it follows from Lemma 13 that for any (s, y) ∈ [0, T ] × Rd , we have
Z Z
q(s, y, a; π)−γ log π ′ (a|s, y) π ′ (a|s, y)da ≥
q(s, y, a; π)−γ log π(a|s, y) π(a|s, y)da = 0,
A A
51
Jia and Zhou
It follows from Assumption 1-(iv), Definition 1-(iii) and the moment estimate in Jia and
Zhou (2022b, Lemma 2) that there exist constants µ, C > 0 such that
Z T ∧un Z
W ′ ′ ′ ′
EP e−β(s−t) [r(s, X̃sπ , a) − γ log π ′ (a|s, X̃sπ )]π ′ (a|s, X̃sπ )dads X̃tπ = x
t A
Z T Z
P W π ′ ′ π′ ′ π′ π′
≤E |r(s, X̃s , a) − γ log π (a|s, X̃s )|π (a|s, X̃s )dads X̃t = x
t A
Z T
PW π′ µ π′
≤C E (1 + |X̃s | ) X̃t = x dt ≤ C(1 + |x|µ ).
t
In addition,
PW −β(T ∧un −t) π′ π′
E e h(X̃T ∧un ) X̃t = x
≤E PW
h(X̃T )1{max
π′
π′
π′
X̃t = x + E PW
h(X̃un )1{max
π′
π′
π′
X̃t = x
t≤s≤T |X̃T |<n} t≤s≤T |X̃T |≥n}
≤CE PW π′ µ π′
(1 + |X̃T | ) X̃t = x + C(1 + n )E µ PW
1{maxt≤s≤T |X̃ π′ |≥n} X̃t = x
π′
T
W ′ ′
EP maxt≤s≤T |X̃Tπ |µ+1 X̃tπ = x
≤C(1 + |x|µ ) + C(1 + nµ )
nµ+1
µ
C(1 + n )(1 + |x| µ+1 )
≤C(1 + |x|µ ) + µ+1
< ∞.
n
Again, by the dominated convergence theorem, we have as n → ∞,
W ′ ′ W ′ ′
EP e−β(T ∧un −t) h(X̃Tπ∧un ) X̃tπ = x → EP e−β(T −t) h(X̃Tπ ) X̃tπ = x .
Hence, sending n → ∞, we conclude from (61) that J(t, x; π) ≤ J(t, x; π ′ ). This proves
that π ′ = Iπ improves upon π. Moreover, if Iπ ≡ π ′ = π, then J(t, x; π) ≡ J(t, x; π ′ ) =:
J ∗ (t, x), which satisfies the PDE (11). However, Lemma 13 shows that
∂J ∗ ∂2J ∗
Z
(t, x) − γ log π (a|t, x) π ′ (a|t, x)da
′
H t, x, a, (t, x),
A ∂x ∂x2
∂J ∗ ∂2J ∗
Z
= sup H t, x, a, (t, x), (t, x) − γ log π(a) π(a)da.
π∈P(A) A ∂x ∂x2
This means that J ∗ also satisfies the HJB equation (12), implying that J ∗ is the optimal
value function and hence π is the optimal policy.
52
Continuous-Time q-Learning
Proof of Proposition 3
Proof Breaking the full time period [t, T ] into subperiods [t, t + ∆t) and [t + ∆t, T ], while
conditioning on the state at t + ∆t for the second subperiod, we obtain:
Q∆t (t, x, a; π)
Z t+∆t
=EP
e−β(s−t) r(s, Xsa , a)ds
t
Z T
−β(s−t) −β(T −t)
[r(s, Xsπ , aπ log π(aπ π
h(XTπ )|Xt+∆t
a
Xtπ̃
+ EP e s) −γ s |s, Xs )]ds +e =x
t+∆t
Z t+∆t
PW −β(s−t) −β∆t
=E e r(s, Xsa , a)ds +e J(t + a
∆t, Xt+∆t ; π) Xtπ̃ =x
t
Z t+∆t
W
=EP e−β(s−t) r(s, Xsa , a)ds + e−β∆t J(t + ∆t, Xt+∆t
a
; π) − J(t, x; π) Xtπ̃ = x + J(t, x; π)
t
t+∆t
∂2J
Z
PW
∂J ∂J
e−β(s−t)
(s, Xsa ; π) + H s, Xsa , a, (s, Xsa ; π), 2 (s, Xsa ; π)
=E
t ∂t ∂x ∂x
− βJ(s, Xsa ; π) ds Xtπ̃ = x + J(t, x; π)
∂2J
∂J ∂J
=J(t, x; π) + (t, x; π) + H t, x, a, (t, x; π), 2 (t, x; π) − βJ(t, x; π) ∆t + o(∆t),
∂t ∂x ∂x
(62)
where the second to the last equality follows from Itô’s lemma and the last equality and the
error order are due to the approximation of the integral involved.
Proof of Corollary 5
Proof of Theorem 6
Proof First, (20) follows readily from its definition in Definition 4, the Feynman–Kac
formula (11), and the fact that ∂J
∂t (t, x; π) and βJ(t, x; π) both do not depend on action a.
53
Jia and Zhou
(i) We now focus on (18). Applying Itô’s lemma to the process e−βs J(s, Xsπ ; π), we obtain
for 0 ≤ t < s ≤ T :
Z s
e−βs J(s, Xsπ ; π) − e−βt J(t, x; π) + e−βu [r(u, Xuπ , aπ π π
u ) − q̂(u, Xu , au )]du
t
Z s h ∂J
π π ∂J ∂2J
e−βu (u, Xuπ , aπ π
(u, Xuπ ; π) − βJ(u, Xuπ ; π)
= u ) + H u, X u , au , (u, X u ; π), 2
t ∂t ∂x ∂x
Z s
i ∂
− q̂(u, Xuπ , aπ
u ) du + e−βu J(u, Xuπ ; π) ◦ dWu
t ∂x
Z s Z s
∂
e−βu q(u, Xuπ , aπ π π
e−βu J(u, Xuπ ; π) ◦ σ(u, Xuπ , aπ
= u ; π) − q̂(u, X u , a u ) du + u )dWu .
t t ∂x
The continuity of X π implies that u > t∗ , P-almost surely. Here “∧” means taking
the smaller one, i.e., u ∧ v = min{u, v}.
R s proved that there exists Ω0 ∈ F with P(Ω0 ) ∗= 0 such that for all
We have already
ω ∈ Ω \ Ω0 , t∗ e−βu f (u, Xuπ (ω), aπ
u (ω))du = 0 for all s ∈ [t , T ]. It follows from
Lebesgue’s differentiation theorem that for any ω ∈ Ω \ Ω0 ,
∗
f (s, Xsπ (ω), aπ
s (ω)) = 0, a.e. s ∈ [t , τ (ω)].
when s ∈ Z(ω), we conclude that Z(ω) has Lebesgue measure zero for any ω ∈ Ω \ Ω0 .
That is,
Z
1{s≤τ (ω)} 1{aπs (ω)∈Bδ (a∗ )} ds = 0.
[t∗ ,T ]
54
Continuous-Time q-Learning
Proof of Theorem 7
Proof
(i) We only prove the “only if” part because the “if” part is straightforward following the
same argument as in the proof of Theorem 6.
ˆ X π )+ t e−βs [r(s, X π , aπ )− q̂(s, X π , aπ )]ds is an ({Fs }s≥0 , P)-
So, we assume e−βt J(t,
R
t 0 s s s s
martingale. Hence for any initial state (t, x), we have
Z T
−βT ˆ
P
E e π
J(T, XT ) + e [r(s, Xs , as ) − q̂(s, Xs , as )]ds Ft = e−βt J(t,
−βs π π π π ˆ x).
t
55
Jia and Zhou
We integrate over the action randomization with respect to the policy π, and then
obtain
Z T Z
PW −βT ˆ −βs
E e π
J(T, X̃T ) + e π π
s s
π
[r(s, X̃ , a) − q̂(s, X̃ , a)]π(a|s, X̃ )dads F W
s = e−βt J(t,
ˆ x)
t
t A
This, together with the terminal condition J(T,ˆ x) = h(x), and constraint (21) yields
that
Z T Z
W −β(T −t) −β(s−t)
ˆ x) = E
J(t, P
e π
h(X̃T ) + e π π π W
[r(s, X̃s , a) − γ log π(s, X̃s , a)]π(a|s, X̃s )dads Ft .
t A
ˆ x) = J(t, x; π) for all (t, x) ∈ [0, T ]×Rd . Furthermore, based on Theorem 6,
Hence J(t,
the martingale condition implies that q̂(t, x, a) = q(t, x, a; π) for all (t, x, a) ∈ [0, T ] ×
Rd × A.
(ii) This follows immediately from Theorem 6-(ii).
(iii) Let π ′ ∈ Π be given satisfying the assumption in this part. Define
∂ Jˆ ∂ Jˆ 1 ∂ 2 Jˆ
r̂(t, x, a) := (t, x) + b(t, x, a) ◦ (t, x) + σσ ⊤ (t, x, a) ◦ ˆ x).
(t, x) − β J(t,
∂t ∂x 2 ∂x2
Then Z s
ˆ X π′ ) −
e−βs J(s,
′
e−βu r̂(u, Xuπ , aπ
′
s u )du
t
is an ({Fs }s≥0 , P)-local martingale, which follows from applying Itô’s lemma to the
above
R s −βuprocess πand arguing similarly to the proof of Theorem 6-(ii). As a result,
′ π ′ )−q̂(u, X π ′ , aπ ′ )+r̂(u, X π ′ , aπ ′ )]du is an ({F }
t e [r(u, Xu , au u u u u s s≥0 , P)-martingale.
The same argument as in the proof of Theorem 6 applies and we conclude
q̂(t, x, a) =r(t, x, a) + r̂(t, x, a)
∂ Jˆ ∂ Jˆ 1 ∂ 2 Jˆ
= (t, x) + b(t, x, a) ◦ (t, x) + σσ ⊤ (t, x, a) ◦ ˆ x) + r(t, x, a)
(t, x) − β J(t,
∂t ∂x 2 ∂x2
∂ Jˆ ∂ Jˆ ∂ 2 Jˆ
ˆ x)
= (t, x) + H t, x, a, (t, x), 2 (t, x) − β J(t,
∂t ∂x ∂x
for every (t, x, a). Now the constraint (21) reads
Z " ˆ #
∂J ∂ Jˆ ∂ 2 Jˆ
ˆ x) − γ log π(a|t, x) π(a|t, x)da = 0,
(t, x) + H t, x, a, (t, x), 2 (t, x) − β J(t,
A ∂t ∂x ∂x
ˆ x) = h(x), is the
for all (t, x), which, together with the terminal condition J(T,
Feynman–Kac PDE (11) for J. ˆ Therefore, it follows from the uniqueness of the
ˆ
solution to (11) that J ≡ J(·, ·; π). Moreover, it follows from Theorem 6-(iii) that
q̂ ≡ q(·, ·, ·; π).
exp{ γ1 q̂(t,x,a)}
Finally, if π(a|t, x) = R
exp{ γ1 q̂(t,x,a)}da
, then π = Iπ where I is the map defined in
A
Theorem 2. This in turn implies π is the optimal policy and, hence, Jˆ is the optimal value
function
56
Continuous-Time q-Learning
Proof of Proposition 8
Proof Divide both sides of (24) by γ, take exponential and then integrate over A. Com-
paring the resulting equation with the exploratory HJB equation (14), we get
Z
1 ∗
γ log exp{ q (t, x, a)}da = 0.
A γ
This yields (25). The expression (26) follows then from (13).
Proof of Theorem 9
Proof
where the · · · du term vanishes due to the definition of q ∗ in (24). Hence (28) is a
R
martingale.
(ii) The second constraint in (27) implies that π c∗ (a|t, x) := exp{ 1 qb∗ (t, x, a)} is a prob-
γ
ability density function, and qb∗ (t, x, a) = γ log π
c∗ (a|t, x). So qb∗ (t, x, a) satisfies the
second constraint in (21) with respect to the policy π c∗ . When (28) is an ({Fs }s≥0 , P)-
martingale under the given admissible policy π, it follows from Theorem 7–(iii) that
c∗ and qb∗ are respectively the value function and the q-function associated with π
J c∗ .
exp{ γ1 qb∗ (t,x,a)}
Then the improved policy is I π c∗ (a|t, x) := R 1 ∗ = exp{ 1 qb∗ (t, x, a)} =
exp{ γ qb (t,x,a)}da γ
A
c∗ (a|t, x).
π However, Theorem 2 yields that c∗
π is optimal, completing the proof.
57
Jia and Zhou
Proof of Theorem 10
Proof Let us compute the KL-divergence:
∂2J
Z
1 ∂J
log π ′ (a|t, x) − H t, x, a, (t, x; π), 2 (t, x; π) π ′ (a|t, x)da
A γ ∂x ∂x
∂2J
′ 1 ∂J
=DKL π (·|t, x) exp{ H t, x, ·, (t, x; π), 2 (t, x; π) }
γ ∂x ∂x
∂2J
1 ∂J
≤DKL π(·|t, x) exp{ H t, x, ·, (t, x; π), 2 (t, x; π) }
γ ∂x ∂x
Z
1 ∂J 2
∂ J
= log π(a|t, x) − H t, x, a, (t, x; π), 2 (t, x; π) π(a|t, x)da.
A γ ∂x ∂x
′
R ′ R
Hence A q(t, x, a; π)−γ log π (a|t, x) π (a|t, x)da ≥ A q(t, x, a; π)−γ log π(a|t, x) π(a|t, x)da.
Following the same argument as in the proof of Theorem 2, we conclude J(t, x; π ′ ) ≥
J(t, x; π).
Proof of Lemma 11
Proof The first statement has been prove in Jia and Zhou (2022b, Lemma 7). We focus
on the proof of the second statement, which is similar to that of Theorem 2.
For the two admissible policies π and π ′ , we apply Itô’s lemma to the value function
′
under π along the process X̃ π to obtain
Z uZ
π′ π′ ′ ′ ′
J(X̃u ; π) − J(X̃t ; π) + [r(X̃sπ , a) − γ log π ′ (a|X̃sπ )]π ′ (a|X̃sπ )dads
t A
Z uZ
′ ∂J ′ ∂2J ′ ′ ′
H X̃sπ , a, (X̃sπ ; π), 2 (X̃sπ ; π) − γ log π ′ (a|Xsπ ) π ′ (a|X̃sπ )dads
=
∂x ∂x
t
Z uA
∂ ′ ′ ′
+ J(X̃sπ ; π) ◦ σ̃ X̃sπ , π ′ (·|X̃sπ ) dWs
t ∂x
Z uZ
′ ∂J ′ ∂2J ′ ′ ′
H X̃sπ , a, (X̃sπ ; π), 2 (X̃sπ ; π) − γ log π(a|Xsπ ) π(a|X̃sπ )dads
≥
∂x ∂x
t
Z uA
∂ ′ ′ ′
+ J(X̃sπ ; π) ◦ σ̃ X̃sπ , π ′ (·|X̃sπ ) dWs
t ∂x
Z u
∂ ′ ′ ′
=V (π)(u − t) + J(X̃sπ ; π) ◦ σ̃ X̃sπ , π ′ (·|X̃sπ ) dWs ,
t ∂x
where the inequality is due to the same argument as in the proof of Theorem 10, and the
last equality is due to the Feynman–Kac formula (38).
Therefore, after a localization argument, we have
Z TZ
1 π′ π′ ′ π′ ′ π′ π′
E J(X̃T ; π)−J(x; π)+ [r(X̃s , a)−γ log π (a|X̃s )]π (a|X̃s )dads X̃0 = x ≥ V (π).
T 0 A
Taking the limit T → ∞, we obtain V (π ′ ) ≥ V (π).
58
Continuous-Time q-Learning
Proof of Theorem 12
Proof The proof is parallel to that of Theorems 6 and 7, which is omitted.
References
L. C. Baird. Advantage updating. Technical report, Write Lab Wright-Patterson Air Force
Base, OH 45433-7301, USA, 1993.
J. Fan, Z. Wang, Y. Xie, and Z. Yang. A theoretical analysis of deep Q-learning. In Learning
for Dynamics and Control, pages 486–489. PMLR, 2020.
S. Gu, T. Lillicrap, I. Sutskever, and S. Levine. Continuous deep Q-learning with model-
based acceleration. In International Conference on Machine Learning, pages 2829–2838.
PMLR, 2016.
59
Jia and Zhou
X. Han, R. Wang, and X. Y. Zhou. Choquet regularization for reinforcement learning. arXiv
preprint arXiv:2208.08497, 2022.
Y. Jia and X. Y. Zhou. Policy evaluation and temporal-difference learning in continuous
time and space: A martingale approach. Journal of Machine Learning Research, 23(198):
1–55, 2022a.
Y. Jia and X. Y. Zhou. Policy gradient and actor-critic learning in continuous time and
space: Theory and algorithms. Journal of Machine Learning Research, 23(275):1–50,
2022b.
I. Karatzas and S. Shreve. Brownian Motion and Stochastic Calculus, volume 113. Springer,
2014.
J. Kim, J. Shin, and I. Yang. Hamilton-Jacobi deep Q-learning for deterministic continuous-
time systems with Lipschitz continuous controls. Journal of Machine Learning Research,
22:206–1, 2021.
A. E. Lim and X. Y. Zhou. Mean-variance portfolio selection with random parameters in a
complete market. Mathematics of Operations Research, 27(1):101–120, 2002.
X. Mao. Stochastic Differential Equations and Applications. Woodhead, 2 edition, 2007.
ISBN 978-1-904275-34-3.
F. S. Melo, S. P. Meyn, and M. I. Ribeiro. An analysis of reinforcement learning with
function approximation. In Proceedings of the 25th International Conference on Machine
Learning, pages 664–671, 2008.
S. P. Meyn and R. L. Tweedie. Stability of Markovian processes III: Foster–Lyapunov
criteria for continuous-time processes. Advances in Applied Probability, 25(3):518–548,
1993.
V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves,
M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep
reinforcement learning. Nature, 518(7540):529–533, 2015.
V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and
K. Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In Interna-
tional Conference on Machine Learning, pages 1928–1937. PMLR, 2016.
B. O’Donoghue. Variational Bayesian reinforcement learning with regret bounds. Advances
in Neural Information Processing Systems, 34:28208–28221, 2021.
I. Osband, B. Van Roy, D. J. Russo, and Z. Wen. Deep exploration via randomized value
functions. Journal of Machine Learning Research, 20(124):1–62, 2019.
J. Schulman, X. Chen, and P. Abbeel. Equivalence between policy gradients and soft Q-
learning. arXiv preprint arXiv:1704.06440, 2017.
D. Shah and Q. Xie. Q-learning with nearest neighbors. Advances in Neural Information
Processing Systems, 31, 2018.
60
Continuous-Time q-Learning
Y. Sun. The exact law of large numbers via Fubini extension and characterization of
insurable risks. Journal of Economic Theory, 126(1):31–69, 2006.
C. Tallec, L. Blier, and Y. Ollivier. Making deep Q-learning methods robust to time dis-
cretization. In International Conference on Machine Learning, pages 6096–6104. PMLR,
2019.
W. Tang, Y. P. Zhang, and X. Y. Zhou. Exploratory HJB equations and their convergence.
SIAM Journal on Control and Optimization, 60(6):3191–3216, 2022.
J. Yong and X. Y. Zhou. Stochastic Controls: Hamiltonian Systems and HJB Equations.
New York, NY: Spinger, 1999.
X. Y. Zhou. Curse of optimality, and how do we break it. SSRN preprint SSRN 3845462,
2021.
X. Y. Zhou and G. Yin. Markowitz’s mean-variance portfolio selection with regime switch-
ing: A continuous-time model. SIAM Journal on Control and Optimization, 42(4):1466–
1482, 2003.
S. Zou, T. Xu, and Y. Liang. Finite-sample analysis for SARSA with linear function
approximation. Advances in Neural Information Processing Systems, 32, 2019.
61