Jia Zhou - JMLR 2023

Journal of Machine Learning Research 24 (2023) 1-61 Submitted 7/22; Revised 4/23; Published 5/23
q-Learning in Continuous Time
Yanwei Jia jiayw1993@gmail.com

Department of System Engineering and Engineering Management
The Chinese University of Hong Kong
Shatin, NT, Hong Kong
Xun Yu Zhou xz2574@columbia.edu
Department of Industrial Engineering and Operations Research &
The Data Science Institute
Columbia University
New York, NY 10027, USA
Editor: Marc Bellemare
Abstract
We study the continuous-time counterpart of Q-learning for reinforcement learning (RL) un-
der the entropy-regularized, exploratory diffusion process formulation introduced by Wang
et al. (2020). As the conventional (big) Q-function collapses in continuous time, we con-
sider its first-order approximation and coin the term “(little) q-function”. This function
is related to the instantaneous advantage rate function as well as the Hamiltonian. We
develop a “q-learning” theory around the q-function that is independent of time discretiza-
tion. Given a stochastic policy, we jointly characterize the associated q-function and value
function by martingale conditions of certain stochastic processes, in both on-policy and
off-policy settings. We then apply the theory to devise different actor–critic algorithms
for solving underlying RL problems, depending on whether or not the density function
of the Gibbs measure generated from the q-function can be computed explicitly. One of
our algorithms interprets the well-known Q-learning algorithm SARSA, and another re-
covers a policy gradient (PG) based continuous-time algorithm proposed in Jia and Zhou
(2022b). Finally, we conduct simulation experiments to compare the performance of our
algorithms with those of PG-based algorithms in Jia and Zhou (2022b) and time-discretized
conventional Q-learning algorithms.
Keywords: continuous-time reinforcement learning, policy improvement, q-function,
martingale, on-policy and off-policy
1 Introduction
A recent series of papers, Wang et al. (2020) and Jia and Zhou (2022a,b), aim at laying
an overarching theoretical foundation for reinforcement learning (RL) in continuous time
with continuous state space (diffusion processes) and possibly continuous action space.
Specifically, Wang et al. (2020) study how to explore strategically (instead of blindly) by
formulating an entropy-regularized, distribution-valued stochastic control problem for dif-
fusion processes. Jia and Zhou (2022a) solve the policy evaluation (PE) problem, namely
to learn the value function of a given stochastic policy, by characterizing it as a martingale
problem. Jia and Zhou (2022b) investigate the policy gradient (PG) problem, that is, to
©2023 Yanwei Jia and Xun Yu Zhou.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided
at http://jmlr.org/papers/v24/22-0755.html.
Jia and Zhou
determine the gradient of a learned value function with respect to the current policy, and
show that PG is mathematically a simpler PE problem and thus solvable by the martingale
approach developed in Jia and Zhou (2022a). Combining these theoretical results naturally
leads to various online and offline actor–critic (AC) algorithms for general model-free (up to
the diffusion dynamics) RL tasks, where one learns value functions and stochastic policies
simultaneously and alternatingly. Many of these algorithms recover and/or interpret some
well-known existing RL algorithms for Markov decision processes (MDPs) that were often
put forward in a heuristic and ad hoc manner.
PG updates a policy along the gradient ascent direction to improve it; so PG is an
instance of the general policy improvement (PI) approach. On the other hand, one of the
earliest and most popular methods for PI is Q-learning (Watkins, 1989; Watkins and Dayan,
1992), whose key ingredient is to learn the Q-function, a function of state and action. The
learned Q-function is then maximized over actions at each state to achieve improvement
of the current policy. The resulting Q-learning algorithms, such as the Q-table, SARSA
and DQN (deep Q-network), are widely used for their simplicity and effectiveness in many
applications.1 Most importantly though, compared with PG, one of the key advantages of
using Q-learning is that it works both on-policy and off-policy (Sutton and Barto, 2018,
Chapter 6).
The Q-function, by definition, is a function of the current state and action, assuming that
the agent takes a particular action at the current time and follows through a given control
policy starting from the next time step. Therefore, it is intrinsically a notion in discrete
time; that is why Q-learning has been predominantly studied for discrete-time MDPs. In a
continuous-time setting, the Q-function collapses to the value function that is independent
of action and hence cannot be used to rank and choose the current actions. Indeed, Tallec
et al. (2019) opine that “there is no Q-function in continuous time”. On the other hand,
one may propose discretizing time to obtain a discretized Q-function and then apply the
existing Q-learning algorithms. However, Tallec et al. (2019) show experimentally that this
approach is very sensitive to time discretization and performs poorly with small time steps.
Kim et al. (2021) take a different approach: they include the action as a state variable in the
continuous-time system by restricting the action process to be absolutely continuous in time
with a bounded growth rate. Thus Q-learning becomes a policy evaluation problem with
the state–action pair as the new state variable. However, they consider only deterministic
dynamic systems and discretize upfront the continuous-time problem. Crucially, that the
action must be absolutely continuous is unpractically restrictive because optimal actions
are often only measurable in time and have unbounded variations (such as the bang–bang
controls) even for deterministic systems.
As a matter of fact, the aforementioned series of papers, Wang et al. (2020) and Jia
and Zhou (2022a,b), are characterized by carrying out all the theoretical analysis within
the continuous setting and taking observations at discrete times (for computing the to-
1. Q-learning is typically combined with various forms of function approximations. For example, linear
function approximations (Melo et al., 2008; Zou et al., 2019), kernel-based nearest neighbor regression
(Shah and Xie, 2018), or deep neural networks (Mnih et al., 2015; Fan et al., 2020). Q-learning is not
necessarily restricted to finite action space in literature; for example, Q-learning with continuous action
spaces is studied in Gu et al. (2016). However, Duan et al. (2016) report and claim that the standard
Q-learning algorithm may become less efficient for continuous action spaces.
2
Continuous-Time q-Learning
tal reward over time) only when implementing the algorithms. Some advantages of this
approach, compared with discretizing time upfront and then applying existing MDP re-
sults, are discussed extensively in Doya (2000); Wang et al. (2020); Jia and Zhou (2022a,b).
More importantly, the pure continuous-time approach minimizes or eliminates the impact
of time-discretization on learning which, as discussed above, becomes critical especially for
continuous-time Q-learning.
Now, if we are to study Q-learning strictly within the continuous-time framework with-
out time-discretization or action restrictions, then the first question is what a proper Q-
function should be. When time is discretized, Baird (1993) and Mnih et al. (2016) define
the so-called advantage function that is the difference in value between taking a particular
action versus following a given policy at any state; that is, the difference between the state–
action value and the state value. Baird (1993, 1994) and Tallec et al. (2019) notice that such
an advantage function can be properly scaled with respect to the size of time discretization
whose continuous-time limit exists, called the (instantaneous) advantage rate function and
can be learned. The advantage updating performs better than conventional Q-learning,
as shown numerically in Tallec et al. (2019). However, these papers again consider only
deterministic dynamic systems and do not fully develop a theory of this rate function. Here
in the present paper, we attack the problem directly from the continuous-time perspective.
In particular, we rename the advantage rate function as the (little) q-function.2 This q-
function in the finite-time episodic setting is a function of the time–state–action triple under
a given policy, analogous to the conventional Q-function. However, it is a continuous-time
notion because it does not depend on time-discretization. This feature is a vital advantage
for learning algorithm design as it avoids the sensitivity with respect to the observation and
intervention frequency (the step size of time-discretization).
For the entropy-regularized, exploratory setting for diffusion processes (first formulated
in Wang et al. 2020) studied in the paper, the q-function of a given policy turns out to
be the Hamiltonian that includes the infinitesimal generator of the dynamic system and
the instantaneous reward function (see Yong and Zhou 1999 for details), plus the temporal
dispersion that consists of the time-derivative of the value function and the depreciation
from discounting. The paper aims to develop a comprehensive q-learning theory around
this q-function and accordingly design alternative RL algorithms other than the PG-based
AC algorithms in Jia and Zhou (2022b). There are three main questions. The first is to
characterize the q-function. For discrete-time MDPs, the PE method is used to learn both
the value function and the Q-function because both functions are characterized by similar
Bellman equations. By contrast, in the continuous-time diffusion setting, a Bellman equa-
tion is only available for the value function (also known as the Feynman–Kac formula, which
yields that the value function solves a second-order linear partial differential equation). The
second question is to establish a policy improvement theorem based on the q-function and
to design corresponding AC algorithms in a sample/data-driven, model-free fashion. The
third question is whether the capability of both on- and off-policy learning, a key advantage
of Q-learning, is intact in the continuous-time setting.
2. We use the lower case letter “q” to denote this rate function, much in the same way as the typical use
of the lower case letter f to denote a probability density function which is the first-order derivative of
the cumulative distribution function usually denoted by the upper case letter F .
3
Jia and Zhou
We provide complete answers to all these questions. First, given a stochastic policy
along with its (learned) value function, the corresponding q-function is characterized by
the martingale condition of a certain stochastic process with respect to an enlarged filtra-
tion (information field) that includes both the original environmental noise and the action
randomization. Alternatively, the value function and the q-function can be jointly deter-
mined by the martingality of the same process. These results suggest that the martingale
perspective for PE and PG in Jia and Zhou (2022a,b) continues to work for q-learning. In
particular, we devise several algorithms to learn these functions in the same way as the
martingale conditions are employed to generate PE and PG algorithms in Jia and Zhou
(2022a,b). Interestingly, a temporal–difference (TD) learning algorithm designed from this
perspective recovers and, hence, interprets a version of the well-known SARSA algorithm in
the discrete-time Q-learning literature. Moreover, inspired by these continuous-time mar-
tingale characterizations, in Appendix A we present martingale conditions for Q-learning
of discrete-time MDPs, which are new to our best knowledge. This demonstrates that the
continuous-time RL study may offer new perspectives even in discrete time. Second, we
prove that a Gibbs sampler with a properly scaled current q-function (independent of time
discretization) as the exponent of its density function improves upon the current policy.
This PI result translates into a policy-updating algorithm when the normalizing constant
of the Gibbs measure can be explicitly computed. Otherwise, if the normalizing constant
is unavailable, we prove a stronger PI theorem in terms of the Kullback–Leibler (KL) di-
vergence, analogous to a result for discrete-time MDPs in Haarnoja et al. (2018a). One
of the algorithms out of this theorem happens to recover a PG-based AC algorithm in Jia
and Zhou (2022b). Finally, we prove that the aforementioned martingale conditions hold
for both the on-policy and off-policy settings. More precisely, the value function and the
q-function associated with a given target policy can be learned based on a dataset generated
by either the target policy itself or an arbitrary behavior policy. So, q-learning indeed works
off-policy as well.
The rest of the paper is organized as follows. In Section 2 we review Wang et al.
(2020)’s entropy-regularized, exploratory formulation for RL in continuous time and space,
and present some useful preliminary results. In Section 3, we establish the q-learning theory,
including the motivation and definition of the q-function and its on-/off-policy martingale
characterizations. q-learning algorithms are developed, depending on whether or not the
normalizing constant of the Gibbs measure is available, in Section 4 and Section 5, respec-
tively. Extension to ergodic tasks is presented in Section 6. In Section 7, we illustrate the
proposed algorithms and compare them with PG-based algorithms and time-discretized,
conventional Q-learning algorithms on two specific examples with simulated data.3 Finally,
Section 8 concludes. The Appendix contains various supplementary materials and proofs
of statements in the main text.
2 Problem Formulation and Preliminaries

Throughout this paper, by convention all vectors are column vectors unless otherwise speci-
fied, and Rk is the space of all k-dimensional vectors (hence k × 1 matrices). We use Sk++ to
3. The code to reproduce our simulation studies is publicly available at https://www.dropbox.com/sh/

34cgnupnuaix15l/AAAj2yQYfNCOtPUc1_7VhbkIa?dl=0.
4
denote all the k × k symmetric and positive definite matrices. Given two matrices A and B
of the same size, denote by A ◦ B their inner product, by |A| the Euclidean/Frobenius norm
of A, by det A the determinant when√A is a square matrix, and by A⊤ the transpose of any
matrix A. For A ∈ Sk++ , we write A = U D1/2 U ⊤ , where A = U DU ⊤ is its eigenvalue
decomposition with U an orthogonal matrix, D a diagonal matrix, and D1/2 the diagonal
matrix whose entries are the square root of those of D. We use f = f (·) to denote the
function f , and f (x) to denote the function value of f at x. We denote by N (µ, Σ) the
probability density function of the multivariate normal distribution with mean vector µ and
covariance matrix Σ. Finally, for any stochastic process X = {Xs , s ≥ 0}, we denote by
{FsX }s≥0 the natural filtration generated by X.
2.1 Classical model-based formulation

For readers’ convenience, we first recall the classical, model-based stochastic control formu-
lation.
Let d, n be given positive integers, T > 0, and b : [0, T ] × Rd × A 7→ Rd and σ :
[0, T ] × Rd × A 7→ Rd×n be given functions, where A ⊂ Rm is the action/control space.
The classical stochastic control problem is to control the state (or feature) dynamics gov-
erned by a stochastic differential equation (SDE), defined on a filtered probability space
W W
Ω, F, P ; {Fs }s≥0 along with a standard n-dimensional Brownian motion W = {Ws ,
s ≥ 0}:
dXsa = b(s, Xsa , as )ds + σ(s, Xsa , as )dWs , s ∈ [0, T ], (1)
where as stands for the agent’s action at time s.
The goal of a stochastic control problem is, for each initial time–state pair (t, x) ∈
[0, T ) × Rd of (1), to find the optimal {FsW }s≥0 -progressively measurable (continuous-time)
sequence of actions a = {as , t ≤ s ≤ T } – also called the optimal control or optimal strategy
– that maximizes the expected total discounted reward:
Z T
PW −β(s−t) a −β(T −t) a a
E e r(s, Xs , as )ds + e h(XT ) Xt = x , (2)
t
where r is an (instantaneous) reward function (i.e., the expected rate of reward conditioned
on time, state, and action), h is the lump-sum reward function applied at the end of the
planning period T , and β ≥ 0 is a constant discount factor that measures the time-value of
the payoff or the impatience level of the agent.
Note in the above, the state process X a = {Xsa , t ≤ s ≤ T } also depends on (t, x).
However, to ease notation, here and henceforth we use X a instead of X t,x,a = {Xst,x,a , t ≤
s ≤ T } to denote the solution to SDE (1) with initial condition Xta = x whenever no
ambiguity may arise.
The (generalized) Hamiltonian is a function H : [0, T ] × Rd × A × Rd × Rd×d → R
associated with problem (1)–(2) defined as (Yong and Zhou, 1999):
1
H(t, x, a, p, q) = b(t, x, a) ◦ p + σσ ⊤ (t, x, a) ◦ q + r(t, x, a). (3)
2
We make the same assumptions as in Jia and Zhou (2022b) to ensure theoretically the
well-posedness of the stochastic control problem (1)–(2).
5
Jia and Zhou
Assumption 1 The following conditions for the state dynamics and reward functions hold
true:
(i) b, σ, r, h are all continuous functions in their respective arguments;
(ii) b, σ are uniformly Lipschitz continuous in x, i.e., for φ ∈ {b, σ}, there exists a
constant C > 0 such that
|φ(t, x, a) − φ(t, x′ , a)| ≤ C|x − x′ |, ∀(t, a) ∈ [0, T ] × A, ∀x, x′ ∈ Rd ;
(iii) b, σ have linear growth in x, i.e., for φ ∈ {b, σ}, there exists a constant C > 0 such
that
|φ(t, x, a)| ≤ C(1 + |x|), ∀(t, x, a) ∈ [0, T ] × Rd × A;
(iv) r and h have polynomial growth in (x, a) and x respectively, i.e., there exist constants
C > 0 and µ ≥ 1 such that
|r(t, x, a)| ≤ C(1 + |x|µ + |a|µ ), |h(x)| ≤ C(1 + |x|µ ), ∀(t, x, a) ∈ [0, T ] × Rd × A.
Classical model-based stochastic control theory has been well developed (e.g., Fleming
and Soner 2006 and Yong and Zhou 1999) to solve the above problem, assuming that the
functional forms of b, σ, r, h are all given and known. A centerpiece of the theory is the
Hamilton–Jacobi–Bellman (HJB) equation
∂V ∗ ∂V ∗ ∂2V ∗

(t, x) + sup H t, x, a, (t, x), 2
(t, x) − βV ∗ (t, x) = 0, V ∗ (T, x) = h(x). (4)
∂t a∈A ∂x ∂x
∗ 2 ∗
Here, ∂V d ∂ V
∂x ∈ R is the gradient and ∂x2 ∈ R
d×d is the Hessian matrix.
Under proper conditions, the unique solution (possibly in the sense of viscosity solution)
to (4) is the optimal value function to the problem (1)–(2), i.e.,
Z T
∗ PW −β(s−t) −β(T −t)
V (t, x) = sup E e r(s, Xsa , as )ds +e h(XTa ) Xta =x .
{as ,t≤s≤T } t
Moreover, the following function, which maps a time–state pair to an action:
∂V ∗ ∂2V ∗

∗
a (t, x) = arg sup H t, x, a, (t, x), (t, x) (5)
a∈A ∂x ∂x2
is the optimal (non-stationary, feedback) control policy of the problem. In view of the
definition of the Hamiltonian (3), the maximization in (5) indicates that, at any given time
and state, one ought to maximize a suitably weighted average of the (myopic) instantaneous
reward and the risk-adjusted, (long-term) positive impact on the system dynamics. See
Yong and Zhou (1999) for a detailed account of this theory and many discussions on its
economic interpretations and implications.
6
2.2 Exploratory formulation in reinforcement learning

We now present the RL formulation of the problem to be studied in this paper. In the RL
setting, the agent has partial or simply no knowledge about the environment (i.e. the func-
tions b, σ, r, h). What she can do is “trial and error” – to try a (continuous-time) sequence of
actions a = {as , t ≤ s ≤ T }, observe the corresponding state process X a = {Xsa , t ≤ s ≤ T }
along with both a stream of discounted running rewards {e−β(s−t) r(s, Xsa , as ), t ≤ s ≤ T }
and a discounted, end-of-period lump-sum reward e−β(T −t) h(XTa ) where β is a given, known
discount factor, and continuously update and improve her actions based on these observa-
tions.4
A critical question is how to strategically generate these trial-and-error sequences of
actions. The idea is randomization; namely, the agent devises and employs a stochastic
policy, which is a probability distribution on the action space, to generate actions according
to the current time–state pair. It is important to note that such randomization itself is
independent of the underlying Brownian motion W , the random source of the original control
problem that stands for the environmental noises. Specifically, assume the probability
space is rich enough to support uniformly distributed random variables on [0, 1] that is
independent of W , and then such a uniform random variable can be used to generate other
random variables with density functions. Let {Zt , 0 ≤ t ≤ T } be a process of mutually
independent copies of a uniform random variable on [0, 1], the construction of which requires
a suitable extension of probability space; see Sun (2006) for details. We then further expand
the filtered probability space to (Ω, F, P; {Fs }s≥0 ) where Fs = FsW ∨ σ(Zt , 0 ≤ t ≤ s) and
the probability measure P, now defined on FT , is an extension from PW (i.e. the two
probability measures coincide when restricted to FTW ).
Let π : (t, x) ∈ [0, T ] × Rd 7→ π(·|t, x) ∈ P(A) be a given (feedback) policy, where P(A)
is a suitable collection of probability density functions.5 At each time s, an action as is
generated or sampled from the distribution π(·|s, Xs ).
Fix a stochastic policy π and an initial time–state pair (t, x). Consider the following
SDE
dXsπ = b(s, Xsπ , aπ π π π
s )ds + σ(s, Xs , as )dWs , s ∈ [t, T ]; Xt = x (6)
defined on (Ω, F, P; {Fs }s≥0 ), where aπ = {aπ s , t ≤ s ≤ T } is an {Fs }s≥0 -progressively

measurable action process generated from π. (6) is an SDE with random coefficients,
whose well-posedness (i.e. existence and uniqueness of solution) is established in Yong and
Zhou (1999, Chapter 1, Theorem 6.16) under Assumption 1. Fix aπ , the unique solution
to (6), X π = {Xsπ , t ≤ s ≤ T }, is the sample state process corresponding to aπ that solves
(1).6 Moreover, following Wang et al. (2020), we add an entropy regularizer to the reward
4. This procedure applies to both the offline and online settings. In the former, the agent can repeatedly
try different sequences of actions over the full time period [0, T ], record the corresponding state processes
and payoffs, and then learn and update the policy based on the resulting dataset. In the latter, the agent
updates the actions as she goes, based on all the up-to-date historical observations.
5. Here we assume that the action space A is continuous and randomization is restricted to those distribu-
tions that have density functions. The analysis and results of this paper can be easily extended to the
cases of discrete action spaces and/or randomization with probability mass functions.
6. Here, X π also depends on the specific copy aπ sampled from π; however, to ease notation we write X π
π
instead of X π,a .
7
Jia and Zhou
function to encourage exploration (represented by the stochastic policy), leading to

Z T
J(t, x; π) =E P
e−β(s−t) [r(s, Xsπ , aπ π π
s ) − γ log π(as |s, Xs )] ds
t
(7)
−β(T −t) π π
+e h(XT ) Xt = x ,
where EP is the expectation with respect to both the Brownian motion and the action
randomization, and γ ≥ 0 is a given weighting parameter on exploration also known as the
temperature parameter.
Wang et al. (2020) consider the following SDE

dXs = b̃ s, Xs , π(·|s, Xs ) dt + σ̃ s, Xs , π(·|s, Xs ) dWs , s ∈ [t, T ]; Xt = x, (8)
where
Z sZ

b̃ s, x, π(·) = b(s, x, a)π(a)da, σ̃ s, x, π(·) = σσ ⊤ (s, x, a)π(a)da.
A A
Intuitively, based on the law of large number, the solution of (8), denoted by {X̃sπ , t ≤ s ≤
T }, is the limit of the average of the sample trajectories X π over randomization (i.e., copies
of π). Rigorously, it follows from the property of Markovian projection due to Brunick and
Shreve (2013, Theorem 3.6) that Xsπ and X̃sπ have the same distribution for each s ∈ [t, T ].
Hence, the value function (7) is identical to
Z T
PW −β(s−t) −β(T −t)
J(t, x; π) =E e R̃(s, X̃sπ , π(·|s, X̃sπ ))ds +e h(X̃Tπ ) X̃tπ =x . (9)
t
R
where R̃(s, x, π) = A [r(s, x, a) − γ log π(a)]π(a)da.
The function J(·, ·; π) is called the value function of the policy π, and the task of RL is
to find
J ∗ (t, x) = max J(t, x; π), (10)
π∈Π
where Π stands for the set of admissible (stochastic) policies. The following gives the precise
definition of admissible policies.
Definition 1 A policy π = π(·|·, ·) is called admissible if
(i) π(·|t, x) ∈ P(A), supp π(·|t, x) = A for every (t, x) ∈ [0, T ] × Rd , and π(a|t, x) :
(t, x, a) ∈ [0, T ] × Rd × A → R is measurable;
(ii) π(a|t, x) is continuous in

R (t, x) and uniformly Lipschitz continuous in x in the total
variation distance, i.e., A |π(a|t, x) − π(a|u, x′ )|da → 0 as (u, x′ ) → (t, x), and there
is a constant C > 0 independent of (t, a) such that
Z
|π(a|t, x) − π(a|t, x′ )|da ≤ C|x − x′ |, ∀x, x′ ∈ Rd ;
A
8
(iii) For any given α > 0, the entropy of π and its α-moment have polynomial growth in x,
i.e., there are constants C = C(α) > 0 and µ′ = µ′ (α) ≥ 1 such that | A − log π(a|t, x)π(a|t, x)da| ≤
R
′ ′
C(1 + |x|µ ), and A |a|α π(a|t, x)da ≤ C(1 + |x|µ ) ∀(t, x).
R
Under Assumption 1 along with (i) and (ii) in Definition 1, Jia and Zhou (2022b, Lemma
2) show that the SDE (8) admits a unique strong solution since its coefficients are locally
Lipschitz continuous and have linear growth; see Mao (2007, Chapter 2, Theorem 3.4) for
the general result. In addition, the growth condition (iii) guarantees that the new payoff
function (9) is finite. More discussions on the conditions required in the above definition
can be found in Jia and Zhou (2022b). Moreover, Definition 1 only contains the conditions
on the policy, hence they can be verified.
Note that the problem (8)–(9) is mathematically equivalent to the problem (6)–(7).
Nevertheless, they serve different purposes in our study: the former provides a framework
for theoretical analysis of the value function, while the latter directly involves observable
samples for algorithm design.
2.3 Some useful preliminary results

Lemma 2 in Jia and Zhou (2022b) yields that the value function of a given admissible policy
π = π(·|·, ·) ∈ Π satisfies the following PDE
∂2J
Z
∂J ∂J
(t, x; π) + H t, x, a, (t, x; π), 2 (t, x; π) − γ log π(a|t, x) π(a|t, x)da
∂t A ∂x ∂x
(11)
− βJ(t, x; π) = 0,
J(T, x; π) = h(x).
This is a version of the celebrated Feynman–Kac formula in the current RL setting. Under
Assumption 1 as well as the admissibility conditions in Definition 1, the PDE (11) admits a
unique viscosity solution among polynomially growing functions (Beck et al., 2021, Theorem
1.1).
On the other hand, Tang et al. (2022, Section 5) derive the following exploratory HJB
equation for the problem (8)–(9) satisfied by the optimal value function:
∂J ∗ ∂J ∗ ∂2J ∗
Z
(t, x) − γ log π(a) π(a)da − βJ ∗ (t, x; π) = 0,

(t, x) + sup H t, x, a, (t, x),
∂t π∈P(A) A ∂x ∂x2
J ∗ (T, x) = h(x).
(12)
Moreover, the optimal (stochastic) policy is a Gibbs measure or Boltzmann distribution:
∂J ∗ ∂2J ∗

∗ 1
π (·|t, x) ∝ exp H t, x, ·, (t, x), (t, x)
γ ∂x ∂x2
or, after normalization,

1 ∂J ∗ ∂2J ∗

exp{ γ H t, x, a, ∂x (t, x), ∂x 2 (t, x) }
π ∗ (a|t, x) = R 1 ∂J ∗ ∂2J ∗
. (13)
A exp{ γ H t, x, a, ∂x (t, x), ∂x2 (t, x) }da
9
Jia and Zhou
This result shows that one should use a Gibbs sampler in general to generate trial-and-error
strategies to explore the environment when the regularizer is chosen to be entropy.7
In the special case when the system dynamics are linear in action and payoffs are
quadratic in action, the Hamiltonian is a quadratic function of action and Gibbs thus
specializes to Gaussian under mild conditions.
Plugging (13) to (12) to replace the supremum operator therein leads to the following
equivalent form of the exploratory HJB equation
∂J ∗ ∂J ∗ ∂2J ∗
Z
1
(t, x) + γ log exp H t, x, a, (t, x), 2
(t, x) da − βJ ∗ (t, x) = 0,
∂t A γ ∂x ∂x (14)
J ∗ (T, x) = h(x).
More theoretical properties regarding (14) and consequently the problem (6)–(7) including
its well-posedness can be found in Tang et al. (2022).
The Gibbs measure can also be used to derive a policy improvement theorem. Wang and
Zhou (2020) prove such a theorem in the context of continuous-time mean–variance port-
folio selection, which is essentially a linear–quadratic (LQ) control problem. The following
extends that result to the general case.
∂2J
Theorem 2 For any given π ∈ Π, define π ′ (·|t, x) ∝ exp{ γ1 H t, x, ·, ∂J

∂x (t, x; π), ∂x2 (t, x; π) }.
If π ′ ∈ Π, then
J(t, x; π ′ ) ≥ J(t, x; π).
Moreover, if the following map
∂2J
exp{ γ1 H t, x, ·, ∂J

∂x (t, x; π), ∂x2 (t, x; π) }
I(π) = R 1 ∂J ∂2J
, π∈Π
A exp{ γ H t, x, a, ∂x (t, x; π), ∂x2 (t, x; π) }da
has a fixed point π ∗ , then π ∗ is the optimal policy.
At this point, Theorem 2 remains a theoretical result that cannot be directly applied
∂J ∂2J
to learning procedures, because the Hamiltonian H t, x, a, ∂x (t, x; π), ∂x2 (t, x; π) depends
on the knowledge of the model parameters which we do not have in the RL context. To
develop implementable algorithms, we turn to Q-learning.
3 q-Function in Continuous Time: The Theory

This section is the theoretical foundation of the paper, with the analysis entirely in con-
tinuous time. We start with defining a Q-function parameterized by a time step ∆t > 0,
and then motivate the notion of q-function that is independent of ∆t. We further provide
martingale characterizations of the q-function.
7. We stress here that the Gibbs sampler is due to the entropy-regularizer, and it is possible to have other
types of distribution for exploration. For example, Han et al. (2022) propose a family of regularizers
that lead to exponential, uniform, and ε-greedy exploratory policies. O’Donoghue (2021); Gao et al.
(2020) study how to choose time and state dependent temperature parameters. One may also carry out
exploration via randomizing value function instead of policies; see e.g. Osband et al. (2019).
10
3.1 Q-function
Tallec et al. (2019) consider a continuous-time MDP and then discretize it upfront to a
discrete-time MDP with time discretization δt. In that setting, the authors argue that “there
is no Q-function in continuous time” because the Q-function collapses to the value function
when δt is infinitesimally small. Here, we take an entirely different approach that does
not involve time discretization. Incidentally, by comparing the form of the Gibbs measure
(13) with that of the widely employed Boltzmann exploration for learning in discrete-time
MDPs, Gao et al. (2020) and Zhou (2021) conjecture that the continuous-time counterpart
of the Q-function is the Hamiltonian. Our approach will also provide a rigorous justification
on and, indeed, a refinement of this conjecture.
Given π ∈ Π and (t, x, a) ∈ [0, T ) × Rd × A, consider a “perturbed” policy of π, denoted
by π̃, as follows: It takes the action a ∈ A on [t, t + ∆t) where ∆t > 0, and then follows π
on [t + ∆t, T ]. The corresponding state process X π̃ , given Xtπ̃ = x, can be broken into two
pieces. On [t, t + ∆t), it is X a which is the solution to
dXsa = b(s, Xsa , a)ds + σ(s, Xsa , a)dWs , s ∈ [t, t + ∆t); Xta = x,
while on [t+∆t, T ], it is X π following (6) but with the initial time–state pair (t+∆t, Xt+∆t
a ).
With ∆t > 0 fixed, we introduce the (∆t-parameterized) Q-function, denoted by Q∆t (t, x, a; π),
to be the expected reward drawn from the perturbed policy, π̃:8
Q∆t (t, x, a; π)
Z t+∆t
=EP
e−β(s−t) r(s, Xsa , a)ds
t
Z T
−β(s−t) −β(T −t)
+ e [r(s, Xsπ , aπ
s) −γ log π(aπ π
s |s, Xs )]ds +e h(XTπ ) Xtπ̃ =x .
t+∆t
Recall that our formulation (7) includes an entropy regularizer term that incentivizes
exploration using stochastic policies. However, in defining Q∆t (t, x, a; π) above we do not
include such a term on [t, t + ∆t) for π̃ because a deterministic constant action a is applied
whose entropy is excluded.
The following proposition provides an expansion of this Q-function in ∆t.
Proposition 3 We have
Q∆t (t, x, a; π)
∂2J

∂J ∂J
=J(t, x; π) + (t, x; π) + H t, x, a, (t, x; π), 2 (t, x; π) − βJ(t, x; π) ∆t + o(∆t).
∂t ∂x ∂x
(15)
So, this ∆t-based Q-function includes three terms. The leading term is the value function
J, which is independent of the action a taken. This is natural because we only apply a for
a time window of length ∆t and hence the impact of this action is only of the order ∆t.
8. In the existing literature, Q-learning with entropy regularization is often referred to as the “soft Q-
learning” with the associated “soft Q-function”. A review of and more discussions on soft Q-learning in
discrete time can be found in Appendix A as well as in Haarnoja et al. (2018b); Schulman et al. (2017).
11
Jia and Zhou
Next, the first-order term of ∆t is the Hamiltonian plus the temporal change of the value
function (consisting of the derivative of the value function in time and the discount term,
both of which would disappear for stationary and non-discounted problems), in which only
the Hamiltonian depends on the action a. It is interesting to note that these are the same
terms in the Feynman–Kac PDE (11), less the entropy term. Finally, the higher-order
residual term, o(∆t), comes from the approximation error of the integral over [t, t + ∆t] and
can be ignored when such errors are aggregated as ∆t gets smaller.
Proposition 3 yields that this Q-function also collapses to the value function when
∆t → 0, similar to what Tallec et al. (2019) claim. However, we must emphasize that
our Q-function is not the same as the one used in Tallec et al. (2019), who discretize time
upfront and study the resulting discrete-time MDP. The corresponding Q-function and value
function therein exist based on the discrete-time RL theory, denoted respectively by Qδt , V δt
where δt is the time discretization size. Then Tallec et al. (2019) argue that the advan-
tage function Aδt = Qδt − V δt can be properly scaled by δt, leading to the existence of
δt
limδt→0 Aδt . In short, the Q-function and the resulting algorithms in Tallec et al. (2019)
are still in the realm of discrete-time MDPs. By contrast, Q∆t (t, x, a; π) introduced here
is a continuous-time notion. It is the value function of a policy that applies a constant
action in [t, t + ∆t] and follows π thereafter. So ∆t here is a parameter representing the
length of period in which the constant action is taken, not the time discretization size as in
Tallec et al. (2019). Most importantly, our Q-function is just a tool used to introduce the q-
δt
function, the centerpiece of this paper. Having said all these, we recognize that limδt→0 Aδt
as identified by Tallec et al. (2019) is the counterpart of our q-function in their setting, and
that we arrive at this same object in two different ways (discretizing upfront then taking
δt → 0, versus a true continuous-time analysis).
3.2 q-function
Since the leading term in the Q-function, Q∆t (t, x, a; π), coincides with the value function
of π and hence can not be used to rank action a, we focus on the first-order approxi-
mation, which gives an infinitesimal state–action value. Motivated by this, we define the
“q-function” as follows:
Definition 4 The q-function of the problem (6)–(7) associated with a given policy π ∈ Π
is defined as
∂2J

∂J ∂J
q(t, x, a; π) = (t, x; π) + H t, x, a, (t, x; π), 2 (t, x; π) − βJ(t, x; π),
∂t ∂x ∂x (16)
d
(t, x, a) ∈ [0, T ] × R × A.
Clearly, this function is the first-order derivative of the Q-function with respect to ∆t,
as an immediate consequence of Proposition 3:
Corollary 5 We have
Q∆t (t, x, a; π) − J(t, x; π)

q(t, x, a; π) = lim . (17)
∆t→0 ∆t
12
Some remarks are in order. First, the q-function is a function of the time–state–action
triple under a given policy, analogous to the conventional Q-function for MDPs; yet it is a
continuous-time notion because it does not depend on any time-discretization. This feature
is a vital advantage for learning algorithm design as Tallec et al. (2019) point out that
the performance of RL algorithms is very sensitive with respect to the time discretization.
Second, the q-function is related to the advantage function in the MDP literature (e.g., Baird
1993; Mnih et al. 2016), which describes the difference between the state–action value and
the state value. Here in this paper, the q-function reflects the instantaneous advantage rate
of a given action at a given time–state pair under a given policy.
We also notice that in (16) only the Hamiltonian depends on a; hence the improved
policy presented in Theorem 2 can also be expressed in terms of the q-function:
∂2J

′ 1 ∂J 1
π (·|t, x) ∝ exp H t, x, ·, (t, x; π), 2 (t, x; π) ∝ exp q(t, x, ·; π) .
γ ∂x ∂x γ
Consequently, if we can learn the q-function q(·, ·, ·; π) under any policy π, then it follows
from Theorem 2 that we can improve π by implementing a Gibbs measure over the q-values,
analogous to, say, ε-greedy policy in classical Q-learning (Sutton and Barto, 2018, Chapter
6). Finally, the analysis so far justifies the aforementioned conjecture about the proper
continuous-time version of the Q-function, and provides a theoretical interpretation to the
widely used Boltzmann exploration in RL.
The main theoretical results of this paper are martingale characterizations of the q-
function, which can in turn be employed to devise algorithms for learning crucial functions
including the q-function, in the same way as in applying the martingale approach for PE
(Jia and Zhou, 2022a).9
The following first result characterizes the q-function associated with a given policy π,
assuming that its value function has already been learned and known.
Theorem 6 Let a policy π ∈ Π, its value function J and a continuous function q̂ : [0, T ] ×
Rd × A → R be given. Then
(i) q̂(t, x, a) = q(t, x, a; π) for all (t, x, a) ∈ [0, T ] × Rd × A if and only if for all (t, x) ∈
[0, T ] × Rd , the following process
Z s
−βs π
e J(s, Xs ; π) + e−βu [r(u, Xuπ , aπ π π
u ) − q̂(u, Xu , au )]du (18)
t
is an ({Fs }s≥0 , P)-martingale, where {Xsπ , t ≤ s ≤ T } is the solution to (6) under π

with Xtπ = x.
(ii) If q̂(t, x, a) = q(t, x, a; π) for all (t, x, a) ∈ [0, T ] × Rd × A, then for any π ′ ∈ Π and
for all (t, x) ∈ [0, T ] × Rd , the following process
Z s
−βs π′ ′ ′ π′ π′
e J(s, Xs ; π) + e−βu [r(u, Xuπ , aπu ) − q̂(u, Xu , au )]du (19)
t
′
is an ({Fs }s≥0 , P)-martingale, where {Xsπ , t ≤ s ≤ T } is the solution to (6) under π ′
′
with initial condition Xtπ = x.
9. Some discrete-time counterparts of these results are presented in Appendix A.
13
Jia and Zhou
(iii) If there exists π ′ ∈ Π such that for all (t, x) ∈ [0, T ] × Rd , (19) is an ({Fs }s≥0 , P)-
′
martingale where Xtπ = x, then q̂(t, x, a) = q(t, x, a; π) for all (t, x, a) ∈ [0, T ]×Rd ×A.
Moreover, in any of the three cases above, the q-function satisfies

Z
q(t, x, a; π) − γ log π(a|t, x) π(a|t, x)da = 0, ∀(t, x) ∈ [0, T ] × Rd .

(20)
A
Jia and Zhou (2022a) unify and characterize the learning of value function (i.e., PE) by
martingale conditions of certain stochastic processes. Theorem 6 shows that learning the
q-function again boils down to maintaining the martingality of the processes (18) or (19).
However, there are subtle differences between these martingale conditions. Jia and Zhou
(2022a) consider only deterministic policies so the martingales therein are with respect to
({FsW }s≥0 , PW ). Jia and Zhou (2022b) extends the policy evaluation to include stochastic
policies but the related martingales are in terms of the averaged state X̃ π and hence also
with respect to ({FsW }s≥0 , PW ). By contrast, the martingality in Theorem 6 is with respect
to ({Ft }t≥0 , P), where the filtration {Ft }t≥0 is the enlarged one that includes the random-
ness for generating actions. Note that the filtration determines the class of test functions
necessary to characterize a martingale through the so-called martingale orthogonality con-
ditions (Jia and Zhou, 2022a). So the above martingale conditions suggest that one ought to
choose test functions dependent of the past and current actions when designing algorithms
to learn the q-function. Moreover, the q-function can be interpreted as a compensator to
guarantee the martingality over this larger information field.10
Theorem 6-(i) informs on-policy learning, namely, learning the q-function of the given
policy π, called the target policy, based on data {(s, Xsπ , aπ s ), t ≤ s ≤ T } generated by π
itself. Nevertheless, one of the key advantages of classical Q-learning for MDPs is that it
also works for off-policy learning, namely, learning the q-function of a given target policy
π based on data generated by a different admissible policy π ′ , called a behavior policy.
On-policy and off-policy reflect two different learning settings depending on the availability
of data and/or the choice of a learning agent. When data generation can be controlled by
the agent, conducting on-policy learning is possible although she could still elect off-policy
learning. By contrast, when data generation is not controlled by the agent and she has to
rely on data under other policies, then it becomes off-policy learning. Theorem 6-(ii) and
-(iii) stipulate that our q-learning in continuous time can also be off-policy.
Finally, (20) is a consistency condition to uniquely identify the q-function. If we only
consider deterministic policies of the form a(·, ·) in which case γ = 0 and π(·|t, x) degenerates
into the Dirac measure concentrating on a(t, x), then (20) reduces to
q(t, x, a(t, x); a) = 0,
which corresponds to equation (23) in Tallec et al. (2019).

The following result strengthens Theorem 6, in the sense that it characterizes the value
function and the q-function under a given policy simultaneously.
Rt Rt
10. More precisely, 0 e−βs q(s, Xsπ , aπ
s ; π)ds is the compensator of e
−βt
J(t, Xtπ ; π) + 0 e−βs r(s, Xsπ , aπ
s )ds.
Recall that a compensator of an adapted stochastic process Yt is a predictable process At such that
Yt − At is a local martingale. Intuitively speaking, the compensator is the drift part of a diffusion
process, extracting the trend of the process.
14
Theorem 7 Let a policy π ∈ Π, a function Jˆ ∈ C 1,2 [0, T ) × Rd ∩ C [0, T ] × Rd with

polynomial growth, and a continuous function q̂ : [0, T ] × Rd × A → R be given satisfying

Z
ˆ q̂(t, x, a) − γ log π(a|t, x) π(a|t, x)da = 0, ∀(t, x) ∈ [0, T ] × Rd . (21)

J(T, x) = h(x),
A
Then
(i) Jˆ and q̂ are respectively the value function and the q-function associated with π if and
only if for all (t, x) ∈ [0, T ] × Rd , the following process
Z s
e−βs J(s,
ˆ Xsπ ) + e−βu [r(u, Xuπ , aπ π π
u ) − q̂(u, Xu , au )]du (22)
t
is an ({Fs }s≥0 , P)-martingale, where {Xsπ , t ≤ s ≤ T } is the solution to (6) with

Xtπ = x.
(ii) If Jˆ and q̂ are respectively the value function and the q-function associated with π,
then for any π ′ ∈ Π and all (t, x) ∈ [0, T ] × Rd , the following process
Z s
−βs ˆ π′ ′ ′ π′ π′
e J(s, Xs ) + e−βu [r(u, Xuπ , aπ
u ) − q̂(u, Xu , au )]du (23)
t
′
is an ({Fs }s≥0 , P)-martingale, where {Xsπ , t ≤ s ≤ T } is the solution to (6) under π ′
′
with initial condition Xtπ = x.
(iii) If there exists π ′ ∈ Π such that for all (t, x) ∈ [0, T ] × Rd , (23) is an ({Fs }s≥0 , P)-
′
martingale where Xtπ = x, then Jˆ and q̂ are respectively the value function and the
q-function associated with π.
exp{ γ1 q̂(t,x,a)}
Moreover, in any of the three cases above, if it holds further that π(a|t, x) = R
exp{ γ1 q̂(t,x,a)}da
,
A
then π is the optimal policy and Jˆ is the optimal value function.

Theorem 7 characterizes the value function and the q-function in terms of a single
martingale condition, in each of the on-policy and off-policy settings, which will be the
foundation for designing learning algorithms in this paper. The conditions in (21) ensure
that Jˆ corresponds to the correct terminal payoff function and q̂ corresponds to the soft
q-function with the entropy regularizer. In particular, if the policy is the Gibbs measure
generated by γ1 q̂, then Jˆ and q̂ are respectively the value function and the q-function under
the optimal policy. Finally, note the (subtle) difference between Theorem 7 and Theorem
6. There may be multiple (J, ˆ q̂) pairs satisfying the martingale conditions of (22) or (23) in
Theorem 7; so the conditions in (21) are required for identifying the correct value function
and q-function.11 By contrast, these conditions are implied if the correct value function has
already been known and given, as in Theorem 6.
ˆ x) = h(x) is changed to J(T,
11. If the terminal condition J(T, ˆ x) = ĥ(x), then (J, ˆ q̂) satisfying the martingale
condition (22) or (23) would be the value function and q-function corresponding to a different learning
task with the pair of Rrunning and terminal reward functions being (r, ĥ). If the normalization condition
ˆ q̂) would

on q̂ is missing, say A q̂(t, x, a) − γ log π(a|t, x) π(a|t, x)da = r̂(t, x) holds instead, then (J,
be the value function and q-function corresponding to the learning task with the running and terminal
reward functions (r − r̂, h).
15
Jia and Zhou
3.3 Optimal q-function

We now focus on the q-function associated with the optimal policy π ∗ in (13). Based on
Definition 4, it is defined as
∂J ∗ ∂J ∗ ∂2J ∗

∗
q (t, x, a) = (t, x) + H t, x, a, (t, x), (t, x) − βJ ∗ (t, x), (24)
∂t ∂x ∂x2
where J ∗ is the optimal value function satisfying the exploratory HJB equation (14).
Proposition 8 We have Z
1
exp{ q ∗ (t, x, a)}da = 1, (25)
A γ
for all (t, x), and consequently the optimal policy π ∗ is
1
π ∗ (a|t, x) = exp{ q ∗ (t, x, a)}. (26)
γ
As seen from its proof (in Appendix C), Proposition 8 is an immediate consequence of
the exploratory HJB equation (14). Indeed, (25) is equivalent to (14) when viewed as an
equation for J ∗ due to (24). However, in terms of q ∗ , satisfying (25) is only a necessary
condition or a minimum requirement for being the optimal q-function, and by no means
sufficient, in the current RL setting. This is because in the absence of knowledge about the
primitives b, σ, r, h, we are unable to determine J ∗ from q ∗ by their relationship (24) which
can be viewed as a PDE for J ∗ . To fully characterize J ∗ as well as q ∗ , we will have to resort
to martingale condition, as stipulated in the following result.
c∗ ∈ C 1,2 [0, T ) × Rd ∩ C [0, T ] × Rd with polynomial growth

Theorem 9 Let a function J
and a continuous function qb∗ : [0, T ] × Rd × A → R be given satisfying
Z
1
∗
J (T, x) = h(x),
c exp{ qb∗ (t, x, a)}da = 1, ∀(t, x) ∈ [0, T ] × Rd . (27)
A γ
Then
c∗ and qb∗ are respectively the optimal value function and the optimal q-function,
(i) If J
then for any π ∈ Π and all (t, x) ∈ [0, T ] × Rd , the following process
Z s
−βs c∗ π
e J (s, Xs ) + e−βu [r(u, Xuπ , aπ b∗ π π
u ) − q (u, Xu , au )]du (28)
t
is an ({Fs }s≥0 , P)-martingale, where {Xsπ , t ≤ s ≤ T } is the solution to (6) under

π with Xtπ = x. Moreover, in this case, π c∗ (a|t, x) = exp{ 1 qb∗ (t, x, a)} is the optimal
γ
policy.
(ii) If there exists one π ∈ Π such that for all (t, x) ∈ [0, T ] × Rd , (28) is an ({Fs }s≥0 , P)-
martingale, then J c∗ and qb∗ are respectively the optimal value function and the optimal
q-function.
16
Theorem 9 lays a foundation for off-policy learning without having to go through it-
erative policy improvement. We emphasize here that to learn the optimal policy and the
optimal q-function, it is infeasible to conduct on-policy learning because π ∗ is unknown and
hence the constraints (21) can not be checked nor enforced. The new constraints in (27)
no longer depend on policies and instead give basic requirements for the candidate value
function and q-function. Maintaining the martingality of (28) under any admissible policy
is equivalent to the optimality of the candidate value function and q-function. Moreover,
Theorem 9-(ii) suggests that we do not need to use data generated by all admissible poli-
cies. Instead, those generated by one – any one – policy should suffice; e.g. we could take
π(a|t, x) = exp{ γ1 qb∗ (t, x, a)} as the behavior policy.
Finally, let us remark that Theorems 7 and 9 are independent of each other in the
sense that one does not imply the other. One may design different algorithms from them
depending on different cases and needs.
4 q-Learning Algorithms When Normalizing Constant Is Available

This section and the next discuss algorithm design based on the theoretical results in the
previous section. Theorems 7 and 9 provide the theoretical foundation for designing both
on-policy and off-policy algorithms to simultaneously learn and update the value function
(critic) and the policy (actor) with proper function approximations of J and q. In this
section, we present these actor–critic, q-learning algorithms when the normalizing constant
involved in the Gibbs measure is available or computable.12
4.1 q-Learning algorithms

Given a policy π, to learn its associated value function and q-function, we denote by J θ
and q ψ the parameterized function approximators that satisfy the two constraints:
Z
J θ (T, x) = h(x),
ψ
q (t, x, a) − γ log π(a|t, x) π(a|t, x)da = 0, (29)
A
for all θ ∈ Θ ⊂ RLθ , ψ ∈ Ψ ⊂ RLψ . Then Theorem 2 suggests that the policy can be
improved by
exp{ γ1 q ψ (t, x, a)}
π ψ (a|t, x) = R 1 ψ , (30)
A exp{ γ q (t, x, a)}da
while by assumption the normalizing constant A exp{ γ1 q ψ (t, x, a)}da for any ψ ∈ Ψ can be
R
explicitly computed. Next, we can use the data generated by π ψ and apply the martingale
condition in Theorem 7 to learn its value function and q-function, leading to an iterative
actor–critic algorithm.
12. Conventional Q-learning or SARSA is often said to be value-based in which the Q-function is a state–
action value function serving as a critic. In q-learning, as Theorem 6 stipulates, the q-function is uniquely
determined by the value function J as its compensator. Hence, q plays a dual role here: it is both a
critic (as it can be derived endogenously from the value function) and an actor (as it derives an improved
policy). This is why we call the q-learning algorithms actor–critic, if with a slight abuse of the term
because the “actor” here is not purely exogenous as with, say, policy gradient.
17
Jia and Zhou
Alternatively, we may choose to directly learn the optimal value function and q-function
based on Theorem 9. In this case, the approximators J θ and q ψ should satisfy
Z
θ 1
J (T, x) = h(x), exp{ q ψ (t, x, a)}da = 1. (31)
A γ
Note that if the policy π in (29) is taken in the form (30), then the second equation in
(31) is automatically satisfied. Henceforth in this section, we focus on deriving algorithms
based on Theorem 9 and hence always impose the constraints (31). In this case, any approx-
imator of the q-function directly gives that of the policy via π ψ (a|t, x) = exp{ γ1 q ψ (t, x, a)};
thereby we avoid learning the q-function and the policy separately. Moreover, making use
of (31) typically results in more special parametric form of the q-function approximator q ψ ,
potentially facilitating more efficient learning. Here is an example.
Example 1 When the system dynamic is linear in action a and reward is quadratic in a,
the Hamiltonian, and hence the q-function, is quadratic in a. So we can parameterize
1 ⊤
q ψ (t, x, a) = − q2ψ (t, x)◦ a − q1ψ (t, x) a − q1ψ (t, x) +q0ψ (t, x), (t, x, a) ∈ [0, T ]×Rd ×Rm ,
2
with q1ψ (t, x) ∈ Rm , q2ψ (t, x) ∈ Sm ψ

++ and q0 (t, x) ∈ R. The corresponding policy is a
multivariate normal distribution for which the normalizing constant can be computed:
−1
ψ ψ ψ
π (·|t, x) = N q1 (t, x), γ q2 (t, x) ,
with its entropy value being − 21 logdet q2ψ (t, x) +

m
2 log 2πeγ. The second constraint on q ψ
in (31) then yields q0ψ (t, x) = γ2 log det q2ψ (t, x) − mγ
2 log 2π. This in turn gives rise to a
more specific parametric form
1 ⊤ γ mγ
q ψ (t, x, a) = − q2ψ (t, x) ◦ a − q1ψ (t, x) a − q1ψ (t, x) + log det q2ψ (t, x) − log 2π.
2 2 2
The next step in algorithm design is to update (θ, ψ) by enforcing the martingale con-
dition stipulated in Theorem 9 and applying the techniques developed in Jia and Zhou
(2022a). A number of algorithms can be developed based on two types of objectives: to
minimize the martingale loss function or to satisfy the martingale orthogonality conditions.
The latter calls for solving a system of equations for which there are further two differ-
ent techniques: applying stochastic approximation as in the classical temporal–difference
(TD) algorithms or minimizing a quadratic function in lieu of the system of equations as in
the generalized methods of moment (GMM). For reader’s convenience, we summarize these
methods in the q-learning context below.
• Minimize the martingale loss function:

 #2 
Z "
1 P  T −β(T −t)
Z T
ψ ψ ψ ψ ψ ψ
E e h(XTπ ) − J θ (t, Xtπ ) + e−β(s−t) [r(s, Xsπ , aπ ψ π π
s ) − q (s, Xs , as )]ds dt .
2 0 t
18
This method is intrinsically offline because the loss function involves the whole horizon
[0, T ]; however we are free to choose optimization algorithms to update the parameters
(θ, ψ). For example, we can apply stochastic gradient decent to update
Z T
∂J θ ψ
θ ← θ + αθ (t, Xtπ )Gt:T dt
0 ∂θ
Z TZ T
∂q ψ ψ ψ
ψ ← ψ + αψ e−β(s−t) (s, Xsπ , aπ
s )dsGt:T dt,
0 t ∂ψ
ψ ψ RT ψ ψ πψ πψ
where Gt:T = e−β(T −t) h(XTπ )−J θ (t, Xtπ )+ t e−β(s−t) [r(s, Xsπ , aπ ψ
s )−q (s, Xs , as )]ds,
and αθ and αψ are the learning rates. We present Algorithm 1 based on this updat-
ing rule. Note that this algorithm is analogous to the classical gradient Monte Carlo
method or TD(1) for MDPs (Sutton and Barto, 2018) because full sample trajectories
are used to compute gradients.
• Choose two different test functions ξt and ζt that are both Ft -adapted vector-valued
stochastic processes, and consider the following system of equations in (θ, ψ):
Z T h i
ψ ψ ψ πψ πψ πψ
EP ξt dJ θ (t, Xtπ ) + r(t, Xtπ , aπ
t )dt − q ψ
(t, Xt , at )dt − βJ θ
(t, Xt )dt = 0,
0
and
Z T h i
P θ πψ πψ πψ ψ πψ πψ θ πψ
E ζt dJ (t, Xt ) + r(t, Xt , at )dt − q (t, Xt , at )dt − βJ (t, Xt )dt = 0.
0
To solve these equations iteratively, we use stochastic approximation to update (θ, ψ)

either offline by13
Z T h i
θ ← θ + αθ ξt dJ θ (t, Xtπ ) + r(t, Xtπ , aπ
t )dt − q ψ
(t, X t , a t )dt − βJ θ
(t, X t )dt ,
0
Z T h i
ψ ← ψ + αψ ζt dJ θ (t, Xtπ ) + r(t, Xtπ , aπt )dt − q ψ
(t, X t , a t )dt − βJ θ
(t, X t )dt ,
0
or online by
h ψ ψ ψ
i
πψ πψ πψ
θ ← θ + αθ ξt dJ θ (t, Xtπ ) + r(t, Xtπ , aπ ψ θ
t )dt − q (t, Xt , at )dt − βJ (t, Xt )dt ,
h ψ ψ ψ
i
πψ πψ πψ
ψ ← ψ + αψ ζt dJ θ (t, Xtπ ) + r(t, Xtπ , aπt )dt − q ψ
(t, Xt , at )dt − βJ θ
(t, Xt )dt .
θ ψ∂q ψ ψ ψ
Typical choices of test functions are ξt = ∂J π π π
∂θ (t, Xt ), ζt = ∂ψ (t, Xt , at ) yielding
algorithms that would be closest to TD-style policy evaluation and Q-learning (e.g., in
Tallec et al. 2019) and at the same time belong to the more general semi-gradient meth-
ods (Sutton and Barto, 2018). We present the online and offline q-learning algorithms,
Algorithms 2 and 3 respectively, based on these test functions. We must, however,
13. Here, when implementing in a computer program, dJ θ is the timestep-wise difference in J θ when time
is discretized, or indeed it is the temporal-difference (TD) of the value function approximator J θ .
19
Jia and Zhou
stress that testing against these two specific functions is theoretically not sufficient
to guarantee the martingale condition. Using them with a rich, large-dimensional
parametric family for J and q makes approximation to the martingale condition finer
and finer. Moreover, the corresponding stochastic approximation algorithms are not
guaranteed to converge in general, and the test functions have to be carefully selected
(Jia and Zhou, 2022a) depending on (J, q). Some of the different test functions leading
to different types of algorithms in the context of policy evaluation are discussed in Jia
and Zhou (2022a).
• Choose the same types of test functions ξt and ζt as above but now minimize the
GMM objective functions:
Z T h i⊤
E ξt dJ (t, Xt ) + r(t, Xt , at )dt − q (t, Xt , at )dt − βJ (t, Xt )dt
0
Z T h i
Aθ E ξt dJ (t, Xt ) + r(t, Xt , at )dt − q (t, Xt , at )dt − βJ (t, Xt )dt ,
0
and
Z T h ψ ψ ψ ψ ψ ψ
i⊤
θ
EP
ζt dJ (t, Xtπ ) + r(t, Xtπ , aπ
t )dt −q ψ
(t, Xtπ , aπ
t )dt − βJ θ
(t, Xtπ )dt
0
Z T h i
Aψ E ζt dJ (t, Xt ) + r(t, Xt , at )dt − q (t, Xt , at )dt − βJ (t, Xt )dt ,
0
L
where Aθ ∈ SL ψ
++ , Aψ ∈ S++ . Typical choices of these matrices are Aθ = ILθ and
θ
T RT
Aψ = ILψ , or Aθ = (EP [ 0 ξt ξt⊤ dt])−1 and Aψ = (EP [ 0 ζt ζt⊤ dt])−1 . Again, we refer
R
the reader to Jia and Zhou (2022a) for discussions on these choices and the connection
with the classical GTD algorithms and GMM method.
4.2 Connections with SARSA

With a fixed time-step size ∆t, we can define our Q-function, Q∆t (t, x, a; π), by (15),
parameterize it by Qφ
∆t (t, x, a; π), and then apply existing (big) Q-learning algorithms such
as SARSA (cf., Sutton and Barto 2018, Chapter 10) to learn the parameter φ. Now, how
do we compare this approach of ∆t-based Q-learning with that of q-learning?
Equation (15) suggests that Q∆t (t, x, a; π) can be approximated by
Q∆t (t, x, a; π) ≈ J(t, x; π) + q(t, x, a; π)∆t.
Our q-learning method is to learn separately the zeroth-order and first-order terms of
Q∆t (t, x, a; π) in ∆t, and the two terms are in themselves independent of ∆t. Our ap-
proach therefore has a true “continuous-time” nature without having to rely on ∆t or any
time disretization, which theoretically facilitates the analysis carried out in the previous
section and algorithmically mitigates the high sensitivity with respect to time discretiza-
tion. Our approach is similar to that introduced in Baird (1994) and Tallec et al. (2019)
where the value function and the (rescaled) advantage function are approximated separately.
20
Algorithm 1 Offline–Episodic q-Learning ML Algorithm

Inputs: initial state x0 , horizon T , time step ∆t, number of episodes N , number of mesh
grids K, initial learning rates αθ , αψ and a learning rate schedule function l(·) (a function
of the number of episodes), functional forms of parameterized value function J θ (·, ·) and
q-function q ψ (·, ·, ·) satisfying (31), and temperature parameter γ.
Required program (on-policy): environment simulator (x′ , r) = Environment∆t (t, x, a)
that takes current time–state pair (t, x) and action a as inputs and generates state x′
at time t + ∆t and instantaneous reward r at time t as outputs. Policy π ψ (a|t, x) =
exp{ γ1 q ψ (t, x, a)}.
Required program (off-policy): observations {atk , rtk , xtk+1 }k=0,··· ,K−1 ∪
{xtK , h(xtK )} = Observation(∆t) including the observed actions, rewards, and state
trajectories under the given behavior policy at the sampling time grids with step size ∆t.
Learning procedure:
Initialize θ, ψ.
for episode j = 1 to N do
Initialize k = 0. Observe initial state x0 and store xtk ← x0 .
{On-policy case
while k < K do
Generate action atk ∼ π ψ (·|tk , xtk ).
Apply atk to environment simulator (x, r) = Environment∆t (tk , xtk , atk ), and ob-
serve new state x and reward r as output. Store xtk+1 ← x and rtk ← r.
Update k ← k + 1.
end while
}
{Off-policy case Obtain one observation {atk , rtk , xtk+1 }k=0,··· ,K−1 ∪ {xtK , h(xtK )} =
Observation(∆t).
}
For every k = 0, 1, · · · , K − 1, compute
K−1
X
−β(T −tk )
Gtk :T = e θ
h(xtK ) − J (tk , xtk ) + e−β(ti −tk ) [rti − q ψ (ti , xti , ati )]∆t.
i=k
Update θ and ψ by
K−1
X ∂J θ
θ ← θ + l(j)αθ (tk , xtk )Gtk :T ∆t.
∂θ
k=0
K−1
"K−1 #
X X ∂q ψ
ψ ← ψ + l(j)αψ (tk , xtk , atk )∆t Gtk :T ∆t.
∂ψ
k=0 i=k
end for
21
Jia and Zhou
Algorithm 2 Offline–Episodic q-Learning Algorithm

Inputs: initial state x0 , horizon T , time step ∆t, number of episodes N , number of mesh
grids K, initial learning rates αθ , αψ and a learning rate schedule function l(·) (a function
of the number of episodes), functional forms of parameterized value function J θ (·, ·) and
q-function q ψ (·, ·, ·) satisfying (31), functional forms of test functions ξ(t, x·∧t , a·∧t ) and
ζ(t, x·∧t , a·∧t ), and temperature parameter γ.
exp{ γ1 q ψ (t, x, a)}.
Required program (off-policy): observations {atk , rtk , xtk+1 }k=0,··· ,K−1 ∪
{xtK , h(xtK )} = Observation(∆t) including the observed actions, rewards, and state
trajectories under the given behavior policy at the sampling time grids with step size ∆t.
Learning procedure:
Initialize θ, ψ.
{On-policy case
while k < K do
serve new state x and reward r as outputs. Store xtk+1 ← x and rtk ← r.
Update k ← k + 1.
end while
}
{Off-policy case Obtain one observation {atk , rtk , xtk+1 }k=0,··· ,K−1 ∪ {xtK , h(xtK )} =
Observation(∆t).
}
For every k = 0, 1, · · · , K − 1, compute and store test functions ξtk =
ξ(tk , xt0 , · · · , xtk , at0 , · · · , atk ), ζtk = ζ(tk , xt0 , · · · , xtk , at0 , · · · , atk ).
Compute
K−1
X
ξti J θ (ti+1 , xti+1 ) − J θ (ti , xti ) + rti ∆t − q ψ (ti , xti , ati )∆t − βJ θ (ti , xti )∆t ,

∆θ =
i=0
K−1
X
ζti J θ (ti+1 , xti+1 ) − J θ (ti , xti ) + rti ∆t − q ψ (ti , xti , ati )∆t − βJ θ (ti , xti )∆t .

∆ψ =
i=0
Update θ and ψ by
θ ← θ + l(j)αθ ∆θ.
ψ ← ψ + l(j)αψ ∆ψ.
end for
22
Algorithm 3 Online-Incremental q-Learning Algorithm

Inputs: initial state x0 , horizon T , time step ∆t, number of mesh grids K, initial learning
rates αθ , αψ and learning rate schedule function l(·) (a function of the number of episodes),
functional forms of parameterized value function J θ (·, ·) and q-function q ψ (·, ·, ·) satisfying
(31), functional forms of test functions ξ(t, x·∧t , a·∧t ) and ζ(t, x·∧t , a·∧t ), and temperature
parameter γ.
exp{ γ1 q ψ (t, x, a)}.
Required program (off-policy): observations {a, r, x′ } = Observation(t, x; ∆t) including
the observed actions, rewards, and state when the current time-state pair is (t, x) under the
given behavior policy at the sampling time grids with step size ∆t.
Learning procedure:
Initialize θ, ψ.
for episode j = 1 to ∞ do
while k < K do
{On-policy case
serve new state x and reward r as outputs. Store xtk+1 ← x and rtk ← r.
}
{Off-policy case
Obtain one observation atk , rtk , xtk+1 = Observation(tk , xtk ; ∆t).
}
Compute test functions ξtk = ξ(tk , xt0 , · · · , xtk , at0 , · · · , atk ), ζtk =
ζ(tk , xt0 , · · · , xtk , at0 , · · · , atk ).
Compute
δ = J θ (tk+1 , xtk+1 ) − J θ (tk , xtk ) + rtk ∆t − q ψ (tk , xtk , atk )∆t − βJ θ (tk , xtk )∆t,
∆θ = ξtk δ,
∆ψ = ζtk δ.
Update θ and ψ by
θ ← θ + l(j)αθ ∆θ.
ψ ← ψ + l(j)αψ ∆ψ.
Update k ← k + 1
end while
end for
23
Jia and Zhou
However, as pointed out already, as much as the two approaches may lead to certain similar
algorithms, they are different conceptually. The ones in Baird (1994) and Tallec et al. (2019)
are still based on discrete-time MDPs and hence depend on the size of time-discretization.
By contrast, the value function and q-function in q-learning are well-defined quantities in
continuous time and independent of any time-discretization.
On the other hand, approximating the value function by J θ and the q-function by q ψ
separately yields a specific approximator of the Q-function by
Qθ,ψ θ ψ
∆t (t, x, a) ≈ J (t, x) + q (t, x, a)∆t. (32)
With this Q-function approximator Qθ,ψ , one of our q-learning based algorithms actually
recovers (a modification of) SARSA, one of the most well-known conventional Q-learning
algorithm.
To see this, consider state Xt at time t. Then the conventional SARSA with the ap-
proximator Qθ,ψ updates parameters (θ, ψ) by
ψ ψ
aπ ψ ψ aπ
θ ← θ + αθ Qθ,ψ π ψ π
∆t (t + ∆t, Xt+∆t , at+∆t ) − γ log π (at+∆t |t + ∆t, Xt+∆t )∆t
t t
∂Qθ,ψ

πψ ψ πψ ψ
− Qθ,ψ
∆t (t, Xt , at ) + r(t, Xt , aπ
t )∆t − βQθ,ψ
∆t (t, Xt , at )∆t
∆t
(t, Xt , aπ
t ),
∂θ
ψ ψ
aπ ψ ψ aπ
ψ ← ψ + αψ Qθ,ψ π ψ π
t t
∂Qθ,ψ

πψ ψ πψ ψ
− Qθ,ψ
∆t (t, Xt , at ) + r(t, Xt , aπ
t )∆t − βQθ,ψ
∆t (t, Xt , at )∆t
∆t
(t, Xt , atπ ),
∂ψ
(33)
ψ ψ at
where aπ
t ∼ π ψ (·|t, Xt ) and aπ ψ
t+∆t ∼ π (·|t + ∆t, Xt+∆t ).
Note that by (32), we have
∂Qθ,ψ
∆t
θ
∂J∆t ∂Qθ,ψ
∆t ∂q ψ
≈ , ≈ ∆t ∆t.
∂θ ∂θ ∂ψ ∂ψ
∂Qθ,ψ ∂Qθ,ψ
This means that ∂ψ∆t has an additional order of ∆t when compared with ∂θ∆t , resulting
in a much slower rate of update on ψ than θ. This observation is also made by Tallec et al.
(2019), who suggest as a remedy modifying the update rule for Q-function by replacing the
partial derivative of the Q-function with that of the advantage function. In our setting, this
∂Qθ,ψ ψ
∂q∆t
means we change ∆t
∂ψ to ∂ψ in the second update rule of (33).
24
Now, the terms in the brackets in (33) can be further rewritten as:
ψ ψ
aπ ψ ψ aπ
Qθ,ψ π ψ π
t t
ψ ψ ψ
− Qθ,ψ π π θ,ψ π
∆t (t, Xt , at ) + r(t, Xt , at )∆t − βQ∆t (t, Xt , at )∆t
ψ ψ
aπ aπ ψ ψ
=J θ (t + ∆t, Xt+∆t
t
) − J θ (t, Xt ) + [q ψ (t + ∆t, Xt+∆t
t
, aπ ψ π
t+∆t ) − q (t, Xt , at )]∆t
ψ
ψ ψ ψ aπ
+ r(t, Xt , aπ θ ψ π 2 ψ π
t )∆t − βJ (t, Xt )∆t − βq (t, Xt , at )(∆t) − γ log π (at+∆t |t + ∆t, Xt+∆t )∆t
t
ψ
aπ ψ ψ
=J θ (t + ∆t, Xt+∆t
t
) − J θ (t, Xt ) − q ψ (t, Xt , aπ π θ
t )∆t + r(t, Xt , at )∆t − βJ (t, Xt )∆t
ψ ψ

aπ πψ ψ πψ aπ ψ
+ q (t + ∆t, Xt+∆t , at+∆t ) − γ log π (at+∆t |t + ∆t, Xt+∆t ) ∆t − βq ψ (t, Xt , aπ
ψ t t
t )(∆t)
2
ψ ψ
≈dJ θ (t, Xt ) − q ψ (t, Xt , aπ π θ
t )dt + r(t, Xt , at )dt − βJ (t, Xt )dt
ψ ψ

ψ aπ πψ ψ πψ aπ
+ q (t + ∆t, Xt+∆t , at+∆t ) − γ log π (at+∆t |t + ∆t, Xt+∆t ) dt,
t t
| {z }
mean-0 term due to (29)
where in the last step we drop the higher-order small term (∆t)2 while replacing the differ-
ence terms with differential ones.
θ
Comparing (33) with the updating rule in Algorithm 3 using test functions ξt = ∂J πψ
∂θ (t, Xt )
ψ ψ ψ
and ζt = ∂q π π
∂ψ (t, Xt , at ), we find that the modified SARSA in Tallec et al. (2019) and
Algorithm 3 differ only by a term whose mean is 0 due to the constraint (29) on the q-
function. This term is solely driven by the action randomization at time t + ∆t. Hence, the
algorithm in Tallec et al. (2019) is noisier and may be slower in convergence speed compared
with the q-learning one. In Section 7, we will numerically compare the results of the above
Q-learning SARSA and q-learning.
5 q-Learning Algorithms When Normalizing Constant Is Unavailable

Theorem 7 assumes that the normalizing constant in the Gibbs measure is available so that
one can enforce the constraint (21). So does Theorem 9 because the second constraint in
(27) is exactly to verify the normalizing constant to be 1. But it is well known that in most
high-dimensional cases computing this constant is a daunting, and often impossible task.
5.1 A stronger policy improvement theorem

To overcome the difficulty arising from an unavailable normalizing constant in soft Q-
learning, Haarnoja et al. (2018b) propose a general method of using a family of stochastic
policies whose densities can be easily computed to approximate π ′ . In our current set-
ting, denote by {π ϕ (·|t, x)}ϕ∈Φ the family of density functions of some tractable distribu-
tions, e.g., the multivariate normal distribution π ϕ (·|t, x) = N µϕ (t, x), Σϕ (t, x) , where
µϕ (t, x) ∈ Rm and Σϕ (t, x) ∈ Sm++ . The learning procedure then proceeds as follows: start-
exp{ γ1 q(t,x,·;π ϕ )}
ing from a policy π ϕ within this family, project the desired policy R
exp{ γ1 q(t,x,a;π ϕ )}da
to
A
25
Jia and Zhou
the set {π ϕ (a|t, x)}ϕ∈Φ by minimizing
exp{ γ1 q(t, x, ·; π ϕ )}
!
ϕ′
ϕ′ 1 ϕ
min DKL π (·|t, x) R ≡ min DKL π (·|t, x) exp{ q(t, x, ·; π )} ,
ϕ′ ∈Φ
A
exp{ γ1 q(t, x, a; π ϕ )}da ′
ϕ ∈Φ γ

where DKL f g := A log fg(a) (a)
R
f (a)da is the Kullback–Leibler (KL) divergence of two
positive functions f, g with the same support on A, where f ∈ P(A) is a probability density
function on A.
This procedure may not eventually lead to the desired target policy, but the newly
defined policy still provably improves the previous policy, as indicated in the following
stronger version of the policy improvement theorem that generalizes Theorem 2.
Theorem 10 Given (t, x) ∈ [0, T ] × Rd , if two policies π ∈ Π and π ′ ∈ Π satisfy
∂2J

′ 1 ∂J
DKL π (·|t, x) exp{ H t, x, ·, (t, x; π), 2 (t, x; π) }
γ ∂x ∂x
∂2J

1 ∂J
≤DKL π(·|t, x) exp{ H t, x, ·, (t, x; π), 2 (t, x; π) } ,
γ ∂x ∂x
then J(t, x; π ′ ) ≥ J(t, x; π).
Theorem 10 in itself is a general result comparing two given policies not necessarily
within any tractable family of densities. However, an implication of the result is that, given
′
a current tractable policy π ϕ , as long as we update it to a new tractable policy π ϕ with

ϕ′ 1 ϕ ϕ 1 ϕ
DKL π (·|t, x) exp{ q(t, x, ·; π )} ≤ DKL π (·|t, x) exp{ q(t, x, ·; π )} ,
γ γ
′
then π ϕ improves upon π ϕ . Thus, to update π ϕ it suffices to solve the optimization problem

ϕ′ 1 ϕ
min DKL π (·|t, x) exp{ q(t, x, ·; π )}
ϕ′ ∈Φ γ
Z
ϕ′ 1 ′
= min log π (a|t, x) − q(t, x, a; π ) π ϕ (a|t, x)da.
ϕ
′
ϕ ∈Φ A γ
The gradient of the above objective function in ϕ′ is

Z
∂ ϕ′ 1 ϕ ′
log π (a|t, x) − q(t, x, a; π ) π ϕ (a|t, x)da
∂ϕ′ A γ
′
∂π ϕ (a|t, x)
Z Z
′ 1 ∂ ϕ′ ′
= log π ϕ (a|t, x) − q(t, x, a; π ϕ ) ′
da + ′
log π (a|t, x) π ϕ (a|t, x)da
γ ∂ϕ ∂ϕ
ZA A

ϕ ′ 1 ϕ ∂ ′ ′
= log π (a|t, x) − q(t, x, a; π ) ′
log π (a|t, x) π ϕ (a|t, x)da,
ϕ
A γ ∂ϕ
where we have noted
Z Z Z
∂ ϕ′ ϕ′ ∂ ϕ′ ∂ ′
′
log π (a|t, x) π (a|t, x)da = ′
π (a|t, x)da = ′
π ϕ (a|t, x)da = 0.
A ∂ϕ A ∂ϕ ∂ϕ A
26
Therefore, we may update ϕ each step incrementally by

ϕ πϕ 1 πϕ ϕ ∂ ϕ
ϕ ← ϕ − γαϕ dt log π (at |t, Xt ) − q(t, Xt , at ; π ) log π ϕ (aπ
t |t, Xt ), (34)
γ ∂ϕ
where we choose the step size to be specifically γαϕ dt, for the reason that will become
evident momentarily.
The above updating rule seems to indicate that we would need to learn the q-function
q(t, x, a; π ϕ ) associated with the policy π ϕ . This is doable in view of Theorem 7, noting that
{π ϕ (·|t, x)}ϕ∈Φ is a tractable family of distributions so that the constraint (21) or (29) can
be computed and enforced. However, we do not need to do so given that (34) only requires
ϕ
the q-value along the trajectory {(t, Xt , aπ t ); 0 ≤ t ≤ T }, instead of the full functional form
q(·, ·, ·; π ϕ ). The former is easier to computed from the temporal difference learning, as will
be evident in the next subsection.
5.2 Connections with policy gradient

Applying Itô’s lemma to J(·, ·; π) and recalling Definition 4, we have the following important
relation between the q-function and the temporal difference of the value function:
q(t, Xtπ , aπ π π π π
t ; π)dt = dJ(t, Xt ; π) + r(t, Xt , at )dt − βJ(t, Xt ; π)dt + {· · · }dWt . (35)
Plugging the above to (34) and ignoring the martingale difference term {· · · }dWt whose
mean is 0, the gradient-based updating rule for ϕ becomes
h ϕ
i
πϕ πϕ πϕ πϕ πϕ
ϕ ← ϕ + αϕ −γ log π ϕ (aπt |t, Xt )dt + dJ(t, Xt ; π ϕ
) + r(t, Xt , at )dt − βJ(t, Xt ; π ϕ
)dt
∂ ϕ πϕ
× log π ϕ (aπ
t |t, Xt ).
∂ϕ
(36)
In this updating rule, we only need to approximate J(·, ·; π ϕ ), which is a problem of PE.
Specifically, denote by J θ (·, ·) the function approximator of J(·, ·; π ϕ ), where θ ∈ Θ, and
we may apply any PE methods developed in Jia and Zhou (2022a) to learn J θ . Given this
value function approximation, the updating rule (36) further specializes to
h i
ϕ πϕ πϕ θ πϕ πϕ πϕ θ πϕ
ϕ ← ϕ + αϕ −γ log π (at |t, Xt )dt + dJ (t, Xt ) + r(t, Xt , at )dt − βJ (t, Xt )dt
∂ ϕ πϕ
× log π ϕ (aπ
t |t, Xt ).
∂ϕ
(37)
Expression (37) coincides with the updating rule in the policy gradient method established
in Jia and Zhou (2022b) when the regularizer therein is taken to be the entropy. Moreover,
since we learn the value function and the policy simultaneously, the updating rule leads to
actor–critic type of algorithms. Finally, with this method, while learning the value function
may involve martingale conditions as in Jia and Zhou (2022a), learning the policy does not.
Schulman et al. (2017) note the equivalence between soft Q-learning and PG in discrete
time. Here we present the continuous-time counterpart of the equivalence, nevertheless with
a theoretical justification, which recovers the PG based algorithms in Jia and Zhou (2022b)
when combined with suitable PE methods.
27
Jia and Zhou
6 Extension to Ergodic Tasks

We now extend the previous study to ergodic tasks, in which the objective is to maximize
the long-run average in the infinite time horizon [0, ∞), the functions b, σ and r do not
depend on time t explicitly, and h = 0. The set of admissible (stationary) policies can be
similarly defined in a straighforward manner. The regularized ergodic objective function is
the long-run average:
Z T Z
1 PW π π π π
lim inf E [r(X̃s , a) − γ log π(a|X̃s )]π(a|X̃s )dads X̃t = x
T →∞ T t A
Z T
1 P π π π π π
= lim inf E [r(Xs , as ) − γ log π(as |Xs )]ds Xt = x ,
T →∞ T t
where γ ≥ 0 is the temperature parameter.

We first present the ergodic version of the Feynman–Kac formula and the corresponding
policy improvement theorem.
Lemma 11 Let π = π(·|·) be a given admissible (stationary) policy.14 Suppose there is a

function J(·; π) ∈ C 2 (Rd ) with polynomial growth and a scalar V (π) ∈ R satisfying
∂2J
Z
∂J
(x; π), 2 (x; π) − γ log π(a|x) π(a|x)da − V (π) = 0, x ∈ Rd . (38)

H x, a,
A ∂x ∂x
Then for any t ≥ 0,

W RT R
V (π) = lim inf T →∞ T1 EP t
π
A [r(X̃s , a) −γ log π(a|X̃sπ )]π(a|X̃sπ )dads X̃tπ =x

RT
= lim inf T →∞ T1 EP π π π π π
t [r(Xs , as ) − γ log π(as |Xs )]ds Xt = x .
(39)
Moreover, if there are two admissible policies π and π ′ such that for all x ∈ Rd ,
∂2J

′ 1 ∂J
DKL π (·|x) exp{ H x, ·, (x; π), 2 (x; π) }
γ ∂x ∂x
∂2J

1 ∂J
≤DKL π(·|x) exp{ H x, ·, (x; π), 2 (x; π) } ,
γ ∂x ∂x
then V (π ′ ) ≥ V (π).
We emphasize that the solution to (38) is a pair of (J, V ), where J is a function of

the state and V ∈ R is a scalar. The long term average of the payoff does not depend on
the initial state x nor the initial time t due to ergodicity, and hence is a constant as (39)
implies. The function J, on the other hand, represents the first-order approximation of the
long-run average and is not unique. Indeed, for any constant c, (J + c, V ) is also a solution
to (38). We refer to V as the value of the underlying problem and still refer to J as the
14. For such ergodic tasks, typically an admissible policies also require the process X π is ergodic (cf. Meyn
and Tweedie 1993).
28
value function. Lastly, since the value does not depend on the initial time, we will fix the
latter as 0 in the following discussions and applications of ergodic learning tasks.
As with the episodic case, for any admissible policy π, we can define the ∆t-parameterized
Q-function as
Q∆t (x, a; π)
Z t+∆t
T
Z
PW a
=E [r(Xs , a) − V (π)]ds + lim E P
[r(Xsπ , aπ π π
s ) − γ log π(as |Xs )
t T →∞ t+∆t

a
Xtπ = x

− V (π)]ds|Xt+∆t
Z t+∆t
PW a a π
=E [r(Xs , a) − V (π)]ds + J(Xt+∆t ; π) Xt = x
t

PW π a
P
π
−E lim E J(XT ; π)|Xt+∆t Xt = x
T →∞
Z t+∆t
PW a a π
=E [r(Xs , a) − V (π)]ds + J(Xt+∆t ; π) − J(x; π) Xt = x + J(x; π)
t

PW π a
P
π
−E lim E J(XT ; π)|Xt+∆t Xt = x
T →∞
Z t+∆t
∂2J

PW
a ∂J a a
π
=E H Xs , a, (X ; π), 2 (Xs ; π) − V (π) ds Xt = x + J(x; π)
t ∂x s ∂x

W
lim EP J(XTπ ; π)|Xt+∆t a
Xtπ = x

− EP
T →∞
∂2J

∂J
(x; π), 2 (x; π) − V (π) ∆t − J¯ + O (∆t)2 ,

=J(x; π) + H x, a,
∂x ∂x
where we assumed that limT →∞ EP J(XTπ ; π)|Xtπ = x = J¯ is a constant for any (t, x) ∈

[0, +∞) × Rd .
The corresponding q-function is then defined as
∂2J

∂J
q(x, a; π) = H x, a, (x; π), 2 (x; π) − V (π).
∂x ∂x
As a counterpart to Theorem 7, we have the following theorem to characterize the value

function, the value, and the q-function for both on-policy and off-policy ergodic learning
problems. A counterpart to Theorem 9 can be similarly established, and we leave details
to the reader.
Theorem 12 Let an admissible policy π, a number V̂ , a function Jˆ ∈ C 2 Rd with poly-

nomial growth, and a continuous function q̂ : Rd × A → R be given satisfying

Z
1 ˆ π
q̂(x, a) − γ log π(a|x) π(a|x)da = 0, ∀x ∈ Rd .

lim E[J(XT )] = 0, (40)
T →∞ T A
Then
29
Jia and Zhou
(i) V̂ , Jˆ and q̂ are respectively the value, the value function and the q-function associated
with π if and only if for all x ∈ Rd , the following process
Z t
ˆ π
J(Xt ) + [r(Xsπ , aπ π π
s ) − q̂(Xs , as ) − V̂ ]ds (41)
0
is an ({Ft }t≥0 , P)-martingale, where {Xtπ , 0 ≤ t < ∞} is the solution to (6) with
X0π = x.
(ii) If V̂ , Jˆ and q̂ are respectively the value, value function and the q-function associated
with π, then for any admissible π ′ and any x ∈ Rd , the following process
Z t
ˆ π′ ′ ′ π′ π′
J(Xt ) + [r(Xuπ , aπ
u ) − q̂(Xu , au ) − V̂ ]du (42)
0
′
is an ({Ft }t≥0 , P)-martingale, where {Xtπ , 0 ≤ t < ∞ is the solution to (6) under π ′
′
with initial condition X0π = x.
(iii) If there exists an admissible π ′ such that for all x ∈ Rd , (42) is an ({Ft }t≥0 , P)-
′
martingale where X0π = x, then V̂ , Jˆ and q̂ are respectively the value, value function
and the q-function associated with π.
exp{ γ1 q̂(x,a)}
Moreover, in any of the three cases above, if it holds further that π(a|x) = R
exp{ γ1 q̂(x,a)}da
,
A
then π is the optimal policy and V̂ is the optimal value.
Based on Theorem 12, we can design corresponding q-learning algorithms for ergodic
problems, in which we learn V, J θ and q ψ at the same time. These algorithms have natural
connections with SARSA as well as the PG-based algorithms developed in Jia and Zhou
(2022b). The discussions are similar to those in Subsections 4.2 and 5.2 and hence are omit-
ted here. An online algorithm for ergodic tasks is presented as an example; see Algorithm
4.
7 Applications
7.1 Mean–variance portfolio selection
We first review the formulation of the exploratory mean–variance portfolio selection prob-
lem, originally proposed by Wang and Zhou (2020) and later revisited by Jia and Zhou
(2022b).15 The investment universe consists of one risky asset (e.g., a stock index) and one
risk-free asset (e.g., a saving account) whose risk-free interest rate is r. The price of the
risky asset {St , 0 ≤ t ≤ T } is governed by a geometric Brownian motion with drift µ and
volatility σ > 0, defined on a filtered probability space (Ω, F, PW ; {FtW }0≤t≤T ). An agent
has a fixed investment horizon 0 < T < ∞ and an initial endowment x0 . A self-financing
portfolio is represented by the real-valued adapted process a = {at , 0 ≤ t ≤ T }, where at
15. There is a vast literature on classical model-based (non-exploratory) continuous-time mean–variance
models formulated as stochastic control problems; see e.g. Zhou and Li (2000); Lim and Zhou (2002);
Zhou and Yin (2003) and the references therein.
30
Algorithm 4 q-Learning Algorithm for Ergodic Tasks

Inputs: initial state x0 , time step ∆t, initial learning rates αθ , αψ , αV and learning rate
schedule function l(·) (a function of time), functional forms of the parameterized value
function J θ (·) and q-function q ψ (·, ·) satisfying (40), functional forms of test functions
ξ(t, x·∧t , a·∧t ), ζ(t, x·∧t , a·∧t ), and temperature parameter γ.
Required program (on-policy): an environment simulator (x′ , r) = Environment∆t (x, a)
that takes initial state x and action a as inputs and generates a new state x′ at ∆t and an
exp{ γ1 q ψ (x,a)}
instantaneous reward r as outputs. Policy π ψ (a|x) = R
exp{ γ1 q ψ (x,a)}da
.
A
Required program (off-policy): observations {a, r, x′ } = Observation(x; ∆t) including
the observed actions, rewards, and state when the current state is x under the given behavior
policy at the sampling time grids with step size ∆t.
Learning procedure:
Initialize θ, ψ, V . Initialize k = 0. Observe the initial state x0 and store xtk ← x0 .
loop
{On-policy case
Generate action a ∼ π ψ (·|x).
Apply a to environment simulator (x′ , r) = Environment∆t (x, a), and observe new
state x′ and reward r as outputs. Store xtk+1 ← x′ .
}
{Off-policy case
Obtain one observation atk , rtk , xtk+1 = Observation(xtk ; ∆t).
}
Compute test functions ξtk = ξ(tk , xt0 , · · · , xtk , at0 , · · · , atk ), ζtk =
ζ(tk , xt0 , · · · , xtk , at0 , · · · , atk ).
Compute
δ = J θ (x′ ) − J θ (x) + r∆t − q ψ (x, a)∆t − V ∆t,
∆θ = ξtk δ,
∆V = δ,
∆ψ = ζtk δ.
Update θ, V and ψ by
θ ← θ + l(k∆t)αθ ∆θ,
V ← V + l(k∆t)αV ∆V,
ψ ← ψ + l(k∆t)αψ ∆ψ.
Update x ← x′ and k ← k + 1.
end loop
is the discounted dollar value invested in the risky asset at time t. Then her discounted
wealth process follows
d(e−rt St )
dXt = at [(µ − r)dt + σdWt ] = at , X 0 = x0 .
e−rt St
31
Jia and Zhou
The agent aims to minimize the variance of the discounted value of the portfolio at time
T while maintaining the expected return to be a certain level; that is,
min Var(XTa ), subject to E[XTa ] = z, (43)

a
where z is the target expected return, and the variance and expectation throughout this
subsection are with respect to the probability measure PW .
The exploratory formulation of this problem with entropy regularizer is equivalent to
Z T
π 2
J(t, x; w) = − max E − (X̃T − w) − γ log π(as |s, X̃ )ds X̃t = x + (w − z)2 , (44)
π π π
π t
subject to
Z sZ
dX̃sπ = (µ − r) aπ(a|s, X̃sπ )dads + σ a2 π(a|s, X̃sπ )dadWs ; X̃tπ = x,
R R
where w is the Lagrange multiplier introduced to relax the expected return constraint; see
Wang and Zhou (2020) for a derivation of this formulation. Note here we artificially add
the minus sign to transform the variance minimization to a maximization problem to be
consistent with our general formulation.
We now follow Wang and Zhou (2020) and Jia and Zhou (2022b) to parameterize the
value function by
J θ (t, x; w) = (x − w)2 e−θ3 (T −t) + θ2 (t2 − T 2 ) + θ1 (t − T ) − (w − z)2 .
Moreover, we parameterize the q-function by
e−ψ1 −ψ3 (T −t) 2 γ

q ψ (t, x, a; w) = − a + ψ2 (x − w) − [log 2πγ + ψ1 + ψ3 (T − t)],
2 2
which is derived to satisfy the constraint (29), as explained in Example 1. The policy
associated with this parametric q-function is π ψ (·|x; w) = N (−ψ2 (x − w), γeψ1 +ψ3 (T −t) ). In
addition to θ = (θ1 , θ2 , θ3 )⊤ and ψ = (ψ1 , ψ2 , ψ3 )⊤ , we also learn the Lagrange multiplier
w by stochastic approximation in the same way as in Wang and Zhou (2020) and Jia and
Zhou (2022b). The full algorithm is summarized in Algorithm 5.16
We then compare by simulations the performances of three learning algorithms: the
q-learning based Algorithm 5 in this paper, the PG-based Algorithm 4 in Jia and Zhou
(2022b), and the ∆t-parameterized Q-learning algorithm presented in Appendix B. We
conduct simulations with the following configurations: µ ∈ {0, ±0.1, ±0.3, ±0.5}, σ ∈
{0.1, 0.2, 0.3, 0.4}, T = 1, x0 = 1, z = 1.4. Other tuning parameters in all the algorithms are
1
chosen as γ = 0.1, m = 10, αθ = αψ = 0.001, αw = 0.005, and l(j) = j 0.51 . To have more
realistic scenarios, we generate 20 years of training data and compare the three algorithms
with the same dataset for N = 20, 000 episodes with a batch size 32. More precisely, 32
trajectories with length T are drawn from the training set to update the parameters to
16. Here we present an offline algorithm as example. Online algorithms can also be devised following the
general study in the previous sections.
32
Algorithm 5 Offline–Episodic q-Learning Mean–Variance Algorithm

Inputs: initial state x0 , horizon T , time step ∆t, number of episodes N , number of time
grids K, initial learning rates αθ , αψ , αw and learning rate schedule function l(·) (a function
of the number of episodes), and temperature parameter γ.
Required program: a market simulator x′ = Market∆t (t, x, a) that takes current time-
state pair (t, x) and action a as inputs and generates state x′ at time t + ∆t.
Learning procedure:
Initialize θ, ψ, w.
Initialize k = 0. Observe the initial state x and store xtk ← x.
while k < K do
Generate action atk ∼ π ψ (·|tk , xtk ; w).
θ ∂q ψ
Compute and store the test function ξtk = ∂J ∂θ (tk , xtk ; w), ζtk = ∂ψ (tk , xtk , atk ; w).
Apply atk to the market simulator x = M arket∆t (tk , xtk , atk ), and observe the output
new state x. Store xtk+1 .
Update k ← k + 1.
end while
(j)
Store the terminal wealth XT ← xtK .
Compute
K−1
X
ξti J θ (ti+1 , xti+1 ; w) − J θ (ti , xti ; w) − q ψ (ti , xti , ati ; w)∆t ,

∆θ =
i=0
K−1
X
ζti J θ (ti+1 , xti+1 ; w) − J θ (ti , xti ; w) − q ψ (ti , xti , ati ; w)∆t .

∆ψ =
i=0
Update θ and ψ by
θ ← θ + l(j)αθ ∆θ.
ψ ← ψ + l(j)αψ ∆ψ.
Update w (Lagrange multiplier) every m episodes:
if j ≡ 0 mod m then
j
1 X (i)
w ← w − αw XT .
m
i=j−m+1
end if
end for
33
Jia and Zhou
be learned for N times. After training completes, for each market configuration we apply
the learned policy out-of-sample repeatedly for 100 times and compute the average mean,
variance and Sharpe ratio for each algorithm. We are particularly interested in the im-
pact of time discretization; so we experiment with three different time discretization steps:
1 1 1
∆t ∈ { 25 , 250 , 2500 }. Note that all the algorithms rely on time discretization at the imple-
mentation stage, where ∆t is the sampling frequency required for execution or computation
of numerical integral for PG and q-learning. For the Q-learning, ∆t plays a dual role: it is
both the parameter in its definition and the time discretization size in implementation.
The numerical results are reported in Tables 1 – 3, each table corresponding to a different
value of ∆t . We observe that for any market configuration, the Q-learning is almost always
the worst performer in terms of all the metrics, and even divergent in certain high volatility
and low return environment (when keeping the same learning rate as in the other cases).
The other two algorithms have very close overall performances, although the q-learning
tends to outperform in high volatile market environments. On the other hand, notably, the
results change only slightly with different time discretizations, for all the three algorithms.
This observation does not align with the claim in Tallec et al. (2019) that the classical Q-
learning is very sensitive to time-discretization, at least for this specific application problem.
In addition, we notice that q-learning and PG-based algorithm produce comparable results
under most market scenarios, while the terminal variance produced by q-learning tends to
be significantly higher than that by PG when |µ| is small or σ is large. In such cases, the
ground truth solution requires higher leverage in risky assets in order to reach the high
target expected return (40% annual return). However, it appears that overall q-learning
strives to learn to keep up with the high target return and as a result ends up with high
terminal variance in those cases. By contrast, the PG tends to maintain lower variance at
the cost of significantly underachieving the target return.
7.2 Ergodic linear–quadratic control

Consider the ergodic LQ control problem where state responds to actions in a linear way
dXt = (AXt + Bat )dt + (CXt + Dat )dWt , X0 = x0 ,
and the goal is to maximize the long term average quadratic payoff
Z T
1
lim inf E r(Xt , at )dt|X0 = x0 ,
T →∞ T 0
with r(x, a) = −( M 2 N 2
2 x + Rxa + 2 a + P x + Qa).
The exploratory formulation of this problem with entropy regularizer is equivalent to
Z T
1 π π π π π
lim inf E r(Xt , at )dt − γ log π(at |Xt )dt X0 = x0 .
T →∞ T 0
This problem falls into the general formulation of an ergodic task studied in Section
6; so we directly implement Algorithm 4 in our simulation. In particular, our function
approximators for J and q are quadratic functions:
e−ψ3 γ
J θ (x) = θ1 x2 + θ2 x, q ψ (x, a) = − (a − ψ1 x − ψ2 )2 − (log 2πγ + ψ3 ),
2 2
34
Table 1: Out-of-sample performance comparison in terms of mean, variance and

Sharpe ratio when data are generated by geometric Brownian motion with
1
∆t = 25 . We compare three algorithms: “Policy Gradient” described in Algorithm 4 in
Jia and Zhou (2022b), “Q-Learning” described in Appendix B, and “q-Learning” described
in Algorithm 5. The first two columns specify market configurations (parameters of the
geometric Brownian motion simulated). “Mean”, “Variance”, and “SR” refer respectively to
the average mean, variance, and Sharpe ratio of the terminal wealth under the corresponding
learned policy over 100 independent runs. We highlight the largest average Sharpe ratio
in bold. “NA” refers to the case where the algorithm diverges. The discretization size is
1
∆t = 25 .
Policy Gradient Q-Learning q-Learning

µ σ
Mean Variance SR Mean Variance SR Mean Variance SR
-0.5 0.1 1.409 0.003 7.134 1.409 0.004 6.663 1.408 0.004 6.805
-0.3 0.1 1.421 0.012 3.880 1.409 0.012 3.707 1.412 0.012 3.762
-0.1 0.1 1.382 0.094 1.285 1.327 0.070 1.263 1.351 0.081 1.272
0.0 0.1 1.090 0.216 0.193 1.098 0.302 0.201 1.108 0.321 0.202
0.1 0.1 1.312 0.147 0.835 1.265 0.108 0.827 1.292 0.133 0.831
0.3 0.1 1.420 0.016 3.312 1.402 0.016 3.181 1.407 0.016 3.224
0.5 0.1 1.405 0.004 6.422 1.403 0.005 6.026 1.403 0.004 6.147
-0.5 0.2 1.417 0.014 3.526 1.416 0.016 3.308 1.416 0.016 3.377
-0.3 0.2 1.445 0.060 1.915 1.428 0.059 1.841 1.434 0.059 1.867
-0.1 0.2 1.335 0.341 0.608 1.402 0.801 0.628 1.405 0.589 0.631
0.0 0.2 1.053 0.539 0.070 1.168 7.995 0.100 1.136 3.422 0.100
0.1 0.2 1.248 0.437 0.380 1.349 2.046 0.411 1.341 1.005 0.412
0.3 0.2 1.445 0.083 1.634 1.422 0.078 1.580 1.431 0.081 1.600
0.5 0.2 1.413 0.018 3.173 1.412 0.020 2.992 1.411 0.019 3.050
-0.5 0.3 1.433 0.039 2.305 1.432 0.043 2.180 1.432 0.041 2.223
-0.3 0.3 1.458 0.162 1.247 1.468 0.196 1.214 1.484 0.217 1.229
-0.1 0.3 1.226 0.569 0.335 1.624 8.959 0.415 1.556 5.558 0.415
0.0 0.3 1.030 0.777 0.036 NA NA NA 1.187 23.269 0.066
0.1 0.3 1.151 0.670 0.202 1.539 13.741 0.271 1.482 13.321 0.271
0.3 0.3 1.448 0.231 1.046 1.475 0.322 1.042 1.485 0.313 1.053
0.5 0.3 1.431 0.048 2.073 1.428 0.052 1.972 1.429 0.050 2.008
-0.5 0.4 1.457 0.093 1.679 1.460 0.102 1.608 1.462 0.102 1.638
-0.3 0.4 1.417 0.346 0.877 1.670 2.363 0.898 1.602 1.176 0.905
-0.1 0.4 1.147 0.805 0.205 NA NA NA 1.711 20.852 0.306
0.0 0.4 1.017 0.893 0.016 NA NA NA 1.211 76.040 0.048
0.1 0.4 1.084 0.825 0.109 NA NA NA 1.539 25.371 0.196
0.3 0.4 1.408 0.379 0.747 NA NA NA 1.609 1.992 0.776
0.5 0.4 1.446 0.102 1.510 1.456 0.125 1.455 1.461 0.128 1.479
35
Jia and Zhou

1
1
∆t = 250 .

µ σ
-0.5 0.1 1.409 0.003 7.130 1.407 0.004 6.671 1.407 0.004 6.805
-0.3 0.1 1.419 0.012 3.878 1.407 0.012 3.710 1.410 0.012 3.762
-0.1 0.1 1.380 0.092 1.285 1.325 0.069 1.264 1.347 0.079 1.272
0 0.1 1.092 0.218 0.201 1.096 0.253 0.201 1.106 0.306 0.202
0.1 0.1 1.314 0.148 0.835 1.267 0.109 0.827 1.293 0.133 0.831
0.3 0.1 1.424 0.017 3.311 1.406 0.017 3.183 1.411 0.017 3.224
0.5 0.1 1.411 0.004 6.421 1.409 0.005 6.032 1.409 0.004 6.147
-0.5 0.2 1.415 0.014 3.524 1.414 0.016 3.312 1.413 0.015 3.377
-0.3 0.2 1.439 0.057 1.915 1.423 0.056 1.843 1.428 0.056 1.867
-0.1 0.2 1.343 0.329 0.620 1.369 0.460 0.628 1.390 0.493 0.631
0 0.2 1.054 0.546 0.070 1.173 9.036 0.100 1.128 2.519 0.100
0.1 0.2 1.241 0.413 0.372 1.350 1.356 0.411 1.340 0.925 0.412
0.3 0.2 1.450 0.084 1.634 1.426 0.079 1.581 1.435 0.081 1.600
0.5 0.2 1.420 0.018 3.173 1.418 0.020 2.995 1.418 0.019 3.050
-0.5 0.3 1.428 0.037 2.304 1.426 0.041 2.183 1.426 0.039 2.223
-0.3 0.3 1.467 0.171 1.248 1.454 0.169 1.215 1.466 0.178 1.229
-0.1 0.3 1.230 0.552 0.326 1.661 10.321 0.415 1.495 2.852 0.415
0 0.3 1.031 0.845 0.036 1.236 38.868 0.066 1.184 21.710 0.066
0.1 0.3 1.142 0.716 0.181 1.541 16.335 0.266 1.456 6.433 0.271
0.3 0.3 1.466 0.230 1.064 1.470 0.260 1.043 1.486 0.281 1.053
0.5 0.3 1.438 0.049 2.073 1.434 0.053 1.974 1.435 0.051 2.008
-0.5 0.4 1.451 0.085 1.679 1.448 0.090 1.611 1.449 0.087 1.638
-0.3 0.4 1.424 0.281 0.890 1.535 0.638 0.898 1.544 0.612 0.905
-0.1 0.4 1.153 0.777 0.192 NA NA NA 1.747 22.299 0.306
0 0.4 1.019 1.000 0.021 NA NA NA 1.233 77.875 0.048
0.1 0.4 1.091 1.006 0.110 NA NA NA 1.696 48.274 0.200
0.3 0.4 1.400 0.384 0.704 1.685 2.737 0.771 1.595 1.189 0.776
0.5 0.4 1.463 0.114 1.510 1.463 0.122 1.457 1.466 0.121 1.479
36

1
1
∆t = 2500 .

µ σ
-0.5 0.1 1.408 0.003 7.130 1.406 0.004 6.677 1.406 0.004 6.805
-0.3 0.1 1.418 0.012 3.878 1.406 0.012 3.713 1.409 0.012 3.762
-0.1 0.1 1.376 0.090 1.285 1.324 0.069 1.264 1.345 0.077 1.272
0.0 0.1 1.092 0.218 0.201 1.096 0.259 0.201 1.105 0.300 0.202
0.1 0.1 1.316 0.149 0.835 1.269 0.111 0.827 1.294 0.133 0.831
0.3 0.1 1.425 0.017 3.311 1.407 0.017 3.185 1.412 0.017 3.224
0.5 0.1 1.412 0.004 6.420 1.410 0.005 6.037 1.410 0.004 6.147
-0.5 0.2 1.413 0.014 3.524 1.411 0.016 3.315 1.411 0.015 3.377
-0.3 0.2 1.435 0.056 1.915 1.419 0.055 1.844 1.424 0.056 1.867
-0.1 0.2 1.328 0.295 0.632 1.366 0.455 0.629 1.384 0.486 0.631
0.0 0.2 1.059 0.610 0.074 1.174 9.894 0.100 1.125 2.521 0.100
0.1 0.2 1.254 0.468 0.380 1.360 1.867 0.411 1.340 0.952 0.412
0.3 0.2 1.450 0.084 1.634 1.428 0.080 1.582 1.436 0.081 1.600
0.5 0.2 1.422 0.018 3.173 1.419 0.020 2.998 1.419 0.019 3.050
-0.5 0.3 1.425 0.036 2.304 1.422 0.040 2.185 1.422 0.039 2.223
-0.3 0.3 1.459 0.169 1.248 1.449 0.167 1.216 1.459 0.177 1.229
-0.1 0.3 1.225 0.597 0.327 1.690 11.223 0.415 1.487 2.943 0.415
0.0 0.3 1.039 0.857 0.044 1.206 36.278 0.066 1.183 23.798 0.066
0.1 0.3 1.169 0.745 0.202 1.457 14.390 0.266 1.462 8.590 0.271
0.3 0.3 1.455 0.215 1.064 1.488 0.356 1.044 1.486 0.296 1.053
0.5 0.3 1.438 0.049 2.073 1.435 0.053 1.975 1.436 0.051 2.008
-0.5 0.4 1.446 0.084 1.679 1.443 0.089 1.612 1.444 0.086 1.638
-0.3 0.4 1.395 0.231 0.873 1.529 0.619 0.899 1.537 0.622 0.905
-0.1 0.4 1.175 0.846 0.229 NA NA NA 1.754 24.745 0.306
0.0 0.4 1.026 0.967 0.025 NA NA NA 1.170 1.66E+09 0.048
0.1 0.4 1.110 0.973 0.129 NA NA NA 1.661 59.872 0.200
0.3 0.4 1.402 0.416 0.720 1.646 3.402 0.771 1.607 1.743 0.776
0.5 0.4 1.456 0.108 1.510 1.469 0.135 1.458 1.467 0.128 1.479
37
Jia and Zhou
Figure 1: Running average rewards of three RL algorithms. A single state trajec-

tory is generated with length T = 106 and discretized at ∆t = 0.1 to which three online
algorithms apply: “Policy Gradient” described in Algorithm 3 in Jia and Zhou (2022b),
“Q-Learning” described in Appendix B, and “q-Learning” described in Algorithm 4. We
repeat the experiments for 100 times for each method and plot the average reward over
time with the shaded area indicating standard deviation. Two dashed horizontal lines are
respectively the omniscient optimal average reward without exploration when the model
parameters are known and the omniscient optimal average reward less the exploration cost.
where the form of the q-function is derived to satisfy the constraint (29), as discussed in
Example 1. The policy associated with this parameterized q-function is π ψ (·|x) = N (ψ1 x +
ψ2 , γeψ3 ). In addition, we have another parameter V that stands for the long-term average.
We then compare our online q-learning algorithm against two theoretical benchmarks
and two other online algorithms, in terms of the running average reward during the learning
process. The first benchmark is the omniscient optimal level, namely, the maximum long
term average reward that can be achieved with perfect knowledge about the environment
and reward (and hence exploration is unnecessary and only deterministic policies are con-
sidered). The second benchmark is the omniscient optimal level less the exploration cost,
which is the maximum long term average reward that can be achieved by the agent who
is however forced to explore under entropy regularization. The other two algorithms are
the PG-based ergodic algorithm proposed in (Jia and Zhou, 2022b, Algorithm 3) and the
∆t-based Q-learning algorithm presented in Appendix B.
In our simulation, to ensure the stationarity of the controlled state process, we set
A = −1, B = C = 0 and D = 1. Moreover, we set x0 = 0, M = N = Q = 2, R = P = 1,
and γ = 0.1. Learning rates for all algorithms are initialized as αψ = 0.001, and decay
38
(a) The path of learned ψ1 . (b) The path of learned ψ2 . (c) The path of learned eψ3 .
Figure 2: Paths of learned parameters of three RL algorithms. A single state

trajectory is generated with length T = 106 and discretized at ∆t = 0.1 to which three
online algorithms apply: “Policy Gradient” described in Algorithm 3 in Jia and Zhou
(2022b), “Q-Learning” described in Appendix B, and “q-Learning” described in Algorithm
4. All the policies are restricted to be in the parametric form of π ψ (·|x) = N (ψ1 x+ψ2 , eψ3 ).
∗
The omniscient optimal policy is ψ1∗ ≈ −0.354, ψ2∗ ≈ −0.708, eψ3 ≈ 0.035, shown in the
dashed line. We repeat the experiments for 100 times for each method and plot as the
shaded area the standard deviation of the learned parameters. The width of each shaded
area is twice the corresponding standard deviation.
(a) ∆t = 1 (b) ∆t = 0.1 (c) ∆t = 0.01
Figure 3: Running average rewards of three RL algorithms with different time

discretization sizes. A single state trajectory is generated with length T = 105 and
discretized at different step sizes: ∆t = 1 in (a), ∆t = 0.1 in (b), and ∆t = 0.01 in
(c). For each step size, we apply three online algorithms: “Policy Gradient” described in
Algorithm 3 in Jia and Zhou (2022b), “Q-Learning” described in Appendix B, and “q-
Learning” described in Algorithm 4. We repeat the experiments for 100 times for each
method and plot the average reward over time with the shaded area indicating standard
deviation.
according to l(t) = max{1,1√log t} . We also repeat the experiment for 100 times under different
seeds to generate samples.
First, we fix ∆t = 0.1 and run the three learning algorithms for sufficiently long time
T = 106 . All the parameters to be learned are initialized as 0 except for the initial variance
of the policies which is set to be 1. Figure 1 plots the running average reward trajectories,
along with their standard deviations (which are all too small to be visible in the figure),
39
Jia and Zhou
as the learning process proceeds. Among the three methods, both the PG and q-learning
algorithms perform significantly better and converge to the optimal level much faster than
the Q-learning. The q-learning algorithm is indeed slightly better than the PG one. Figure
2 shows the iterated values of the leaned parameters for the three algorithms. Eminently,
the Q-learning has the slowest convergence. The other two algorithms, while converging at
similar speed eventually, exhibit quite different behaviors on their ways to convergence in
learning ψ1 and ψ2 . The PG seems to learn these parameters faster than the q-learning
at the initial stage, but then quickly overshoot the optimal level and take a long time to
correct it, while the q-learning appears much smoother and does not have such an issue.
At last, we vary the ∆t value to examine its impact on the performances of the three
algorithms. We vary ∆t ∈ {1, 0.1, 0.01} and set T = 105 for each experiment. The learning
rates and parameter initializations are fixed and the same for all the there methods. We
repeat the experiment for 100 times and present the results in terms of the average run-
ning reward in Figure 3. We use the shaded area to denote the standard deviation. It is
clear that the Q-learning algorithm is very sensitive to ∆t. In particular, its performance
worsens significantly as ∆t becomes smaller. When ∆t = 0.01, the algorithm almost has
no improvement at all over time. This drawback of sensitivity in ∆t is consistent with the
observations made in Tallec et al. (2019). On the other hand, the other two algorithms
show remarkable robustness across different discretization sizes.
7.3 Off-policy ergodic linear–quadratic control

The experiments reported in Subsection 7.2 are for on-policy learning. In this subsection
we revisit the problem but assuming that we now have to work off-policy. Specifically, we
have a sufficiently long observation of the trajectories of state, action and reward generated
under a behavior policy that is not optimal. In our experiment, we take the behavior policy
to be at ∼ N (0, 1).
We still examine the three algorithms: “Policy Gradient” described in Algorithm 3
in Jia and Zhou (2022b), “Q-Learning” described in Appendix B, and “q-Learning” de-
scribed in Algorithm 4, with different time discretization sizes. Because off-policy learning
is often necessitated in cases where online interactions are costly (Uehara et al., 2022),
we restrict ourselves to an offline dataset generated under the behavior policy. We vary
∆t ∈ {1, 0.1, 0.01} and set T = 106 for each experiment. We repeat the experiments for 100
times whose results are presented in Figure 4. Because it is generally hard or impossible
to implement the learned policy and observe the resulting state and reward data in an off-
policy setting, we focus on the learned parameters of the policy to see if they converge to the
desired optimal value. The horizontal axis in Figure 4 stands for the number of iterations
that are used to update the policy parameters.
In general, policy gradient methods are not applicable in the off-policy setting. This
is confirmed by our experiments which show that the corresponding algorithm diverges
quickly in all the cases. Q-learning still suffers from the sensitivity with respect to time
discretization. The q-learning algorithm is the most stable one and converges in all scenarios.
It is worth noting that in Figure 4-(a), the Q-learning and q-learning algorithms converge
to the same limits, which are however different from the true optimal parameter values of
the continuous-time problem. This is because the theoretical integrals involved in the value
40
functions are not well approximated by finite sums with coarse time discretization (∆t = 1);
hence the corresponding martingale conditions under the learned parameters are not close
to the theoretical continuous-time martingale conditions.
(a) ∆t = 1 (b) ∆t = 0.1 (c) ∆t = 0.01
Figure 4: Paths of learned parameters of three RL algorithms with different

time discretization sizes. A single state trajectory is generated with length T = 106
and discretized at different step sizes: ∆t = 1 in (a), ∆t = 0.1 in (b), and ∆t = 0.01 in
(c), under the behavior policy at ∼ N (0, 1). From top to bottom are the paths of learned
ψ1 , ψ2 , eψ3 respectively. For each step size, we apply three algorithms: “Policy Gradient”
described in Algorithm 3 in Jia and Zhou (2022b), “Q-Learning” described in Appendix
B, and “q-Learning” described in Algorithm 4. We repeat the experiments for 100 times
for each method and plot the average reward over time with the shaded area indicating
standard deviation.
8 Conclusion
The previous trilogy on continuous-time RL under the entropy-regularized exploratory dif-
fusion process framework (Wang et al., 2020; Jia and Zhou, 2022a,b) have respectively
studied three key elements: optimal samplers for exploration, PE and PG. PG is a special
41
Jia and Zhou
instance of policy improvement, and in the discrete-time MDP literature, it can be done
through Q-function. However, there was no satisfying Q-learning theory in continuous time,
so a different route was taken in Jia and Zhou (2022b) – to turn the PG task into a PE
problem.
This paper fills this gap and adds yet another essential building block to the theoretical
foundation of continuous-time RL, by developing a q-learning theory commensurate with
the continuous-time setting, attacking the general policy improvement problem, and cov-
ering both on- and off-policy learning. Although the conventional Q-function provides no
information about actions in continuous time, its first-order approximation, which we call
the q-function, does. Moreover, it turns out that the essential component of the q-function
relevant to policy updating is the Hamiltonian associated with the problem, the latter hav-
ing already played a vital role in the classical model-based stochastic control theory. This
fact, together with the expression of the optimal stochastic policies explicitly derived in
Wang et al. (2020), in turn, explains and justifies the widely used Boltzmann exploration
in classical RL for MDPs.
We characterize the q-function as the compensator to maintain the martingality of a
process consisting of the value function and cumulative reward, both in the on-policy and off-
policy settings. This characterization enables us to use the martingale-based PE techniques
developed in Jia and Zhou (2022a) to simultaneously learn the q-function and the value
function. Interestingly, the martingality is on the same process as in PE (Jia and Zhou,
2022a) but with respect to an enlarged filtration containing the policy randomization. The
TD version of the resulting algorithm links to the well-known SARSA algorithm in classical
Q-learning.
A significant and outstanding research question is an analysis on the convergence rates
of q-learning algorithms. The existing convergence rate results for discrete-time Q-learning
cannot be extended to continuous-time q-learning due to the continuous state space and
general nonlinear function approximations, along with the fact that the behaviors of the Q-
function and the q-function can be fundamentally different. A possible remedy is to carry
out the convergence analysis within the general framework of stochastic approximation,
which is poised to be the subject of a substantial future study.
We make a final observation to conclude this paper. The classical model-based approach
typically separates “estimation” and “optimization”, à la Wonham’s separation theorem
(Wonham, 1968) or the “plug-in” method. This approach first learns a model (including
formulating and estimating its coefficients/parameters) and then optimizes over the learned
model. Model-free (up to some basic dynamic structure such as diffusion processes or
Markov chains) RL takes a different route: it skips estimating a model and learns optimizing
policies directly via PG or Q/q-learning. The q-learning theory in this paper suggests
that, to learn policies, one needs to learn the q-function or the Hamiltonian. In other
words, it is the Hamiltonian which is a specific aggregation of the model coefficients, rather
than each and every individual model coefficient, that needs to be learned/estimated for
optimization. Clearly, from a pure computational standpoint, estimating a single function
– the Hamiltonian – is much more efficient and robust than estimating multiple functions
(b, σ, r, h in the present paper) in terms of avoiding or reducing over-parameterization,
sensitivity to errors and accumulation of errors. More importantly, however, (35) hints
that the Hamiltonian/q-function can be learned through temporal differences of the value
42
function, so the task of learning and optimizing can be accomplished in a data-driven way.
This would not be the case if we chose to learn individual model coefficients separately. This
observation, we believe, highlights the fundamental differences between the model-based and
model-free approaches.
Acknowledgments and Disclosure of Funding
Zhou is supported by a start-up grant and the Nie Center for Intelligent Asset Manage-
ment at Columbia University. His work is also part of a Columbia-CityU/HK collaborative
project that is supported by the InnoHK Initiative, The Government of the HKSAR, and
the AIFT Lab. We are grateful for comments from seminar and conference participants at
Chinese University of Hong Kong, University of Iowa, Columbia University, Seoul National
University, POSTECH, Ritsumeikan University, Shanghai University of Finance and Eco-
nomics, Fudan University, The 2022 INFORMS Annual Meeting in Indianapolis, The 11th
Annual Meeting of FE Branch in OR Society of China in Shijiazhuang, Conference in Mem-
ory of Tomas Björk in Stockholm, and Post-Bachelier Congress Workshop in Hong Kong.
In particular, we benefit from discussions with Jiro Akahori, Xuefeng Gao, Bong-Gyu Jang,
Hyeng Keun Koo, Lingfei Li, Hideo Nagai, Jun Sekine, Wenpin Tang, and David Yao. We
are especially indebted to the three anonymous referees for their constructive and detailed
comments that have led to an improved version of the paper.
43
Jia and Zhou
Appendix A. Soft Q-learning in Discrete Time

We review the soft Q-learning for discrete-time Markov decision processes (MDPs) here and
present the analogy to some of the major results developed in the main text. It is interesting
that, to our best knowledge, the martingale perspective for MDPs is not explicitly presented.
This in turn suggests that a continuous-time study may provide new aspects and insights
even for discrete-time RL.
For simplicity, we consider a time-homogeneous MDP X = {Xt , t = 0, 1, 2, · · · } with a
state space X , an action space A, and a transition matrix P(X1 = x′ |X0 = x, a0 = a) =:
p(x′ |x, a). Both X and A are finite sets. The expected reward at (x, Pa) is r(x, a) with
a
∞ t r(X , a ) . A
discount factor β ∈ (0, 1). The agent’s total expected reward is E t=0 β t t
(stochastic) policy is denoted by π(·|x) ∈ P(A), which is a probability mass function on A.
Appendix A1. Q-function associated with an arbitrary policy

Define the value function associated with a given policy π by
"∞ #
X
t π π π π π
J(x; π) =E β [r(Xt , at ) − γ log π(at |Xt )] X0 = x
t=0 (45)
h i
=E [r(x, aπ π π π
0 ) − γ log π(a0 |x)] + βE J(X1 ; π) X0 = x ,
and the Q-function associated with π by

"∞ #
X
t π π π π π π
Q(x, a; π) =r(x, a) + E β [r(Xt , at ) − γ log π(at |Xt )] X0 = x, a0 = a
t=1 (46)
h i
=r(x, a) + βE J(X1a ; π) X0π = x, aπ
0 =a .
Adding the entropy term −γ log π(a|x) to both sides of (46), integrating over a and noting
(45), we obtain a relation between the Q-function and the value function :
E [Q(x, aπ ; π) − J(x; π) − γ log π(aπ |x)] = 0. (47)
Moreover, substituting J(X1a ; π) with (47) to the right hand side of (46), we further obtain
h i
Q(x, a; π) = r(x, a) + βE Q(X1a , aπ π a π π
1 ; π) − γ log π(a1 |X1 ) X0 = x, a0 = a . (48)
Applying suitable algorithms (e.g., stochastic approximation) to solve the equation (48)
leads to the classical SARSA algorithm.
On the other hand, rewrite (46) as
h i
J(x; π) = r(x, a) − [Q(x, a; π) − J(x; π)] + βE J(X1a ; π) X0π = x, aπ
0 = a . (49)
Recall that A(x, a; π) := Q(x, a; π) − J(x; π) is called the advantage function for MDPs.
The q-function in the main text (see (17)) is hence an advantage rate function in the
continuous-time setting. In particular, (47) is analogous to the second constraint on q-
function in (29).
44
Further, (49) implies that
s−1 h i
′ ′ ′ ′ π′ π′
X
Msπ := β s J(Xsπ ; π) + β t r(Xtπ , aπ
t ) − A(Xt , at ; π)
t=0
is an ({Fs }s≥0 , P)-martingale under any policy π ′ , where Fs is the σ-algebra generated by
′ ′ π′ π′
X0π , aπ
0 , · · · , Xs , as . To see this, we compute
s−1
" #
h i
π′ π′ π′ π′ π′
X
E β s J(Xs ; π) + βt r(Xt , at ) − A(Xt , at ; π) Fs−1
t=0
h i Xs−1 h i
′ π′ ′ π′ π′ π′ π′
=β s E J(Xsπ ; π) Xs−1 , aπ
s−1 + β t
r(Xt , at ) − A(Xt , at ; π)
t=0
h i s−1 h i
π′ π′ ′ π′ ′ ′ ′ π′ π′
X
s−1
=β J(Xs−1 ; π) − r(Xs−1 , aπ
s−1 ) + A(Xs−1 , aπ
s−1 ; π) + β t r(Xtπ , aπ
t ) − A(Xt , at ; π)
t=0
s−2 h i
′ ′ ′ π′ π′
X
=β s−1 J(Xs−1
π
; π) + β t r(Xtπ , aπ
t ) − A(Xt , at ; π) .
t=0
(50)
Note that here Fs is larger than the usual “historical information set” up to time s,
′ ′ ′
denoted by Hs , which is the σ-algebra generated by X0π , aπ 0 , · · · , Xsπ without observing
′ π′ π′
the last aπs (i.e. Hs is Fs excluding as ). Because Ms is measurable to the smaller
σ-algebra Hs , it is automatically an ({Hs }s≥0 , P)-martingale.
′
The above analysis shows that Msπ is a martingale for any π ′ – whether the target policy
π or a behavior one – so long as the advantage function A is taken as the compensator in
′
defining Msπ . This is the essential reason why Q-learning works for both on- and off-
policy learning. However, the conclusion is not necessarily true in other types of learning
methods. For example,h let us take a different compensatori and investigate the process
s π ′ Ps−1 t π ′ π ′ π ′ π ′
β J(Xs ; π) + t=0 β r(Xt , at ) − γ log π(at |Xt ) . The continuous-time counterpart
of this process is a martingale required in the policy gradient method; see (Jia and Zhou,
2022b, Theorem 4).
Now, conditioned on Hs−1 and when π ′ = π, we have
s−1
" #
X
E β s J(Xsπ ; π) + β t [r(Xtπ , aπ π π
t ) − γ log π(at |Xt )] Hs−1
t=0
h s−1
π i X
=β s E E J(Xsπ ; π) Xs−1
π
, aπ
s−1 Xs−1 + β t [r(Xtπ , aπ π π
t ) − γ log π(at |Xt )]
t=0
h i s−1
X
s−1 π π
=β J(Xs−1 ; π) − r(Xs−1 , aπ
s−1 ) + π
A(Xs−1 , aπ
s−1 ; π)
π
Xs−1 + β t [r(Xtπ , aπ π π
t ) − γ log π(at |Xt )]
t=0
s−2
X
=β s−1 J(Xs−1
π
; π) + β t [r(Xtπ , aπ π π
t ) − γ log π(at |Xt )] ,
t=0
45
Jia and Zhou
where the last equality is due to (47), which can be applied because aπ
s−1 is generated under
the policy π. This shows that the process is an ({Hs }s≥0 , P)-martingale when π ′ = π,
which in turn underpins the on-policy learning.
However, when conditioned on Fs−1 , we have
s−1
" #
h i
π′ π′ π′ π′ π′
X
s t
E β J(Xs ; π) + β r(Xt , at ) − γ log π(at |Xt ) Fs−1
t=0
h i s−1 h i
′ π′ ′ ′ ′ π′ π′
X
s
=β E J(Xsπ ; π) Xs−1 , aπ
s−1 + β t r(Xtπ , aπ
t ) − γ log π(a t |Xt )
t=0
h i Xs−1 h i
π′ π′ ′ π′ π′ π′ π′ π′ π′
=β s−1 J(Xs−1 ; π) − r(Xs−1 , aπ
s−1 ) + A(X , a
s−1 s−1 ; π) + β t
r(Xt , at ) − γ log π(at |Xt )
t=0
s−2 h i
′ ′ ′ π′ π′
X
=β s−1 J(Xs−1
π
; π) + β t r(Xtπ , aπ
t ) − γ log π(a t |Xt )
t=0
h i
s−1 π′ ′ π′ π′
+β A(Xs−1 , aπ
s−1 ; π) − γ log π(a t |Xt ) .
π , aπ ; π) = ′ ′
So the same process is not an ({Fs }s≥0 , P)-martingale in general, unless A(Xs−1 s−1
π ′ π ′ π ′ π ′
γ log π(at |Xt ) for any Xs−1 , as−1 . Two conclusions can be drawn from this example: on
one hand, policy gradient works for on-policy but not off-policy, and on the other hand, it
is important to choose the correct filtration to ensure martingality and hence the correct
test functions in designing learning algorithms.
The analysis in (50) yields a joint martingale characterization of (J, A), analogous to
Theorem 7. It is curious that, to our best knowledge, such a martingale condition, while
not hard to derive in the MDP setting, has not been explicitly introduced nor utilized to
study Q-learning in the existing RL literature.
Appendix A2. Q-function associated with the optimal policy

Now consider the value function and Q-function that are associated with the optimal policy,
denoted by J ∗ and Q∗ respectively. Recall that the Bellman equation implies
n h io
∗ ∗ a
J (x) = sup Ea∼π r(x, a) − γ log π(a|x) + βE J (X1 ) X0 = x, a0 = a . (51)
π
It follows from (46) that

h i
Q∗ (x, a) = r(x, a) + βE J ∗ (X1a ) X0 = x, a0 = a . (52)
Hence (51) becomes

J ∗ (x) = sup Ea∼π [Q∗ (x, a) − γ log π(a|x)] .
π
Solving the optimization on the right hand side of the above we get the optimal policy
exp{ γ1 Q∗ (x,a)}
π ∗ (a|x) ∝ exp{ γ1 Q∗ (x, a)} or π ∗ (a|x) = P
exp{ γ1 Q∗ (x,a)}
. Denote
a∈A
X 1
softγ max Q∗ (x, a) := Ea∼π∗ [Q∗ (x, a) − γ log π ∗ (a|x)] ≡ γ log exp{ Q∗ (x, a)}.
a γ
a∈A
46
Then the Bellman equation becomes J ∗ (x) = softγ maxa Q∗ (x, a) and (52) reduces to

∗ ∗
Q (x, a) = r(x, a) + βE softγ max
′
Q (X1a , a′ ) X0 = x, a0 = a , (53)
a
which is the foundation for (off-policy) Q-learning algorithms; see e.g. (Sutton and Barto,
2018, p. 131).
On the other hand, because J ∗ (x) = softγ maxa Q∗ (x, a) = γ log a∈A exp{ γ1 Q∗ (x, a)},
P
we have
X 1 ∗ ∗
exp [Q (x, a) − J (x)] = 1. (54)
γ
a∈A
Recall the advantage function A∗ (x, a) = Q∗ (x, a) − J ∗ (x). Thus (54) is analogous to (25).
Moreover, rearranging (52) yields
h i
J ∗ (x) = r(x, a) − [Q∗ (x, a) − J ∗ (x)] + βE J ∗ (X1a ) X0 = x, a0 = a .
Based on a similar derivation to that in Appendix A1, we obtain
s−1
X
β s J ∗ (Xsπ ) + β t [r(Xtπ , aπ ∗ π π
t ) − A (Xt , at )]
t=0
is an ({Fs }s≥0 , P)-martingale for any policy π; hence it is analogous to the martingale
characterization of optimal q-function in Theorem 9.
Appendix B. ∆t-Parameterized Q-Learning

Instead of the (little) q-learning approach developed in this paper, one can also apply the
conventional (big) Q-learning method to the Q-function Q∆t (t, x, a; π) defined in (15). Note
that ∆t > 0 becomes a parameter in the latter approach. In this Appendix, we review this
∆t-parameterized Q-learning and the associated Q-learning algorithms, which are used in
the simulation experiments of the paper for comparison purpose.
Given a policy π and a time step ∆t > 0, Q-learning focuses on learning Q∆t (t, x, a; π).
As with q-learning, there are two distinctive scenarios: when the normalizing constant is
easily available and when it is not. In the following, we denote by Qψ
∆t the function approx-
imator to the Q-function where ψ ∈ Ψ ⊂ RLψ is the finite dimensional parameter vector to
1
be learned, and by π ψ the associated policy where π ψ (a|t, x) ∝ exp{ γ∆t Qψ∆t (t, x, a)}.
Appendix B1. When normalizing constant is available

1
Qψ
R
When the normalization constant A exp{ γ∆t ∆t (t, x, a)}da can be explicitly calculated,
ψ
we can obtain the exact expression of π as
ψ
1
exp{ γ∆t Qψ
∆t (t, x, a)}
π (a|t, x) = R 1 ψ
.
A exp{ γ∆t Q∆t (t, x, a)}da
47
Jia and Zhou
Then, the conventional Q-learning method (e.g, SARSA, Sutton and Barto 2018, Chap-
ter 10) leads to the following updating rule of the Q-function parameters ψ:

ψ ← ψ + αψ Qψ at ψ at
(55)
∂Qψ

− Qψ
∆t (t, Xt , at ) + r(t, Xt , at )∆t − βQψ
∆t (t, Xt , at )∆t
∆t
(t, Xt , at ),
∂ψ
where at ∼ π ψ (·|t, Xt ) and at+∆t ∼ π ψ (·|t + ∆t, Xt+∆t ).
Appendix B2. When normalizing constant is unavailable

When the normalizing constant is not available, we cannot compute the exact expression of
π ψ . In this case we follow an analysis analogous to that in Subsection 5.1 by using a family
of policies {π ϕ (a|t, x)}ϕ∈Φ whose densities can be easily calculated to approximate π ψ .
Theorem 10 implies that if we start from a policy π ϕ , ϕ ∈ Φ, and evaluate its correspond-
1
Qψ 1 ∂J ϕ ), ∂ 2 J (t, x; π ϕ ) + 1 J(t, x; π ϕ ), then

ing Q-function γ∆t ∆t (t, x, a) ≈ γ H t, x, ·, ∂x (t, x; π ∂x 2 γ∆t
′
a new policy π ϕ , ϕ′ ∈ Φ, improves π ϕ so long as

′ 1 1
DKL π ϕ (·|t, x) exp{ Qψ (t, x, ·)} ≤ D KL π ϕ
(·|t, x) exp{ Qψ
(t, x, ·)} .
γ∆t ∆t γ∆t ∆t
Thus, following the same lines of argument as in Subsection 5.1, we derive the updating
rule of ϕ every step as

ϕ 1 ψ ∂
ϕ ← ϕ − αϕ log π (at |t, Xt ) − Q∆t (t, Xt , at ) log π ϕ (at |t, Xt ),
γ∆t ∂ϕ
where at ∼ π ϕ (·|t, Xt ).
Appendix B3. Ergodic tasks

For the general regularized ergodic problem formulated in Section 6, we define the Q-
function:
∂J ∂2J ¯ (56)

Q∆t (x, a; π) = J(x; π) + H x, a, (x; π), 2 (x; π) − V (π) ∆t − J.
∂x ∂x
Then we have all the parallel results regarding relation among the Q-function, the value
function and the intertemporal increment of the Q-function, and devise algorithms accord-
ingly.
For illustration, we only present the results when
ψ
1
exp{ γ∆t Qψ
∆t (x, a)}
π (a|x) = R 1 ψ
A exp{ γ∆t Q∆t (x, a)}da
48
is explicitly available. The updating rules of the parameters ψ and V are

ψ ← ψ + αψ Qψ at ψ at
∆t (Xt+∆t , at+∆t ) − γ log π (at+∆t |Xt+∆t )∆t
ψ
∂Q∆t
− Qψ
∆t (Xt , at ) + r(Xt , at )∆t − V ∆t (Xt , at )
∂ψ
(57)
V ← V + αV Qψ at ψ at
∆t (Xt+∆t , at+∆t ) − γ log π (at+∆t |Xt+∆t )∆t

ψ
− Q∆t (Xt , at ) + r(Xt , at )∆t − V ∆t ,
where at ∼ π ψ (·|Xt ) and at+∆t ∼ π ψ (·|Xt+∆t ).
Appendix B4. Q-learning for mean–variance portfolio selection

Motivated by the form of the solution in Wang and Zhou (2020), we parameterize the
Q-function by
1
Qψ
∆t (t, x, a; w) = −e
−ψ3 (T −t)
[(x−w)2 +ψ2 a(x−w)+ e−ψ1 a2 ]+ψ4 (t2 −T 2 )+ψ5 (t−T )+(w−z)2 .
2
Here we use −e−ψ1 < 0 to guarantee the concavity of the Q-function in a so that the
function can be maximized.
This Q-function in turn gives rise to an explicit form of policy, which is Gaussian:
1
π ψ (·|t, x; w) = N −ψ2 eψ1 (x − w), γ∆teψ3 (T −t)+ψ1 ∝ exp{ Qψ (t, x, ·)}.
γ∆t ∆t
We need the following derivative in the corresponding Q-learning algorithms:

 1 −ψ3 (T −t) −ψ1 2 
2e e a
−e 3 (T −t) e−ψ1 a(x −
−ψ w)
∂Qψ
 
∆t
 
(t, x, a; w) = (T − t)e −ψ (T −t) 2 1 −ψ 2
[(x − w) + ψ2 a(x − w) + 2 e .
a ]
 3 1
∂ψ  t2 − T 2 
t−T
The parameters in the Q-function can then be updated based on (55) and the Lagrange
multiplier can be updated similarly as in Algorithm 5.
Appendix B5. Q-learning for ergodic LQ control

For the ergodic LQ control problem formulated in Subsection 7.2, we can prove that the
value function and the Hamiltonian are both quadratic in (x, a); see, e.g., Wang et al.
(2020). Therefore we parameterize the Q-function by a general quadratic function:
e−ψ3
Qψ
∆t (x, a) = − (a − ψ1 x − ψ2 )2 + ψ4 x2 + ψ5 x.
2
49
Jia and Zhou
Here we omit the constant term since the value function is unique only up to a constant.
We also use −e−ψ3 < 0 to ensure that the Q-function is concave in a. The optimal value V
is an extra parameter.
The Q-function leads to
1
π ψ (·|x) = N ψ1 x + ψ2 , γ∆teψ3 ∝ exp{ Qψ (x, ·)}.
γ∆t ∆t
Finally, we compute the following derivative for use in Q-learning algorithms:
⊤
∂Qψ e−ψ3

∆t −ψ3 −ψ3 2 2
(x, a) = e (a − ψ1 x − ψ2 )x, e (a − ψ1 x − ψ2 ), (a − ψ1 x − ψ2 ) , x , x .
∂ψ 2
Then the parameters can be updated based on (57).
Appendix C. Proofs of Statements

Proof of Theorem 2
We first prove a preliminary result about the entropy maximizing density.
Lemma 13 Let γ > 0 and a measurable function q : A → R with A exp{ γ1 q(a)}da < ∞
R
exp{ γ1 q(a)}
be given. Then π ∗ (a) = R
exp{ γ1 q(a)}da
∈ P(A) is the unique maximizer of the following
A
problem Z
max [q(a) − γ log π(a)]π(a)da. (58)
π(·)∈P(A) A
1
R
Proof Set ν = 1 − log A exp{ γ q(a)}da and consider a new problem:
Z

max [q(a) − γ log π(a)]π(a) + γνπ(a) da − γν. (59)
π(·)∈P(A) A
R
Noting A π(a)da = 1 due to π(·) ∈ P(A), problems (58) and (59) are equivalent; hence
we focus on problem (59). Relaxing the constraint of being a density function we deduce
Z

max [q(a) − γ log π(a)]π(a) + γνπ(a) da − γν
π(·)∈P(A) A
Z

≤ max [q(a) − γ log π(a)]π(a) + γνπ(a) da − γν (60)
π(·)>0 A
Z

≤ max [q(a) − γ log π]π + γνπ da − γν.
A π>0
The unique maximizer of the inner optimization on the right hand side of the above is
1 exp{ γ1 q(a)}
exp{ q(a) + ν − 1} = R 1 =: π ∗ (a).
γ A exp{ γ q(a)}da
As π ∗ ∈ P(A), it is the optimal solution to (59) or, equivalently, (58). The uniqueness
follows from that of the inner optimization in (60).
50
Now we are ready to prove Theorem 2. To ease notation we use the q-function q even
though it had not been introduced prior to Theorem 2 in the main text. We also employ
the results of Theorem 6 whose proof does not rely on Theorem 2 nor its proof here.
Recall from (16), q(t, x, a; π) is defined to be
∂2J

∂J ∂J
q(t, x, a; π) = (t, x; π) + H t, x, a, (t, x; π), 2 (t, x; π) − βJ(t, x; π),
∂t ∂x ∂x
with the Hamiltonian H defined in (3).
Proof [Proof of Theorem 2]

We work on the equivalent formulation (8)–(9). For two given admissible policies π and
′
π ′ , and any 0 ≤ t ≤ T , apply Itô’s lemma to the process e−βs J(s, Xsπ ; π) from s ∈ [t, T ],
which is the value function under π but over the state process under π ′ :
Z T Z
−βT π′ −βt π′ −βs ′ ′ ′
e J(T, X̃T ; π) − e J(t, X̃t ; π) + e [r(s, X̃sπ , a) − γ log π ′ (a|s, X̃sπ )]π ′ (a|s, X̃sπ )dads
t A
Z T
∂2J
Z
∂J ′ ′ ∂J ′ ′ ′
e−βs (s, X̃sπ ; π) + H s, X̃sπ , a, (s, X̃sπ ; π), 2 (s, X̃sπ ; π) − βJ(s, X̃sπ ; π)

=
t A ∂t ∂x ∂x
Z T
′ ′ ∂ ′ ′ ′
− γ log π ′ (a|s, Xsπ ) π ′ (a|s, X̃sπ )dads + e−βs J(s, X̃sπ ; π) ◦ σ̃ s, X̃sπ , π ′ (·|s, X̃sπ ) dWs

t ∂x
Z T Z
′ ′ ′
e−βs q(s, X̃sπ , a; π) − γ log π ′ (a|s, Xsπ ) π ′ (a|s, X̃sπ )dads

=
t A
Z T
∂ ′ ′ ′
+ e−βs J(s, X̃sπ ; π) ◦ σ̃ s, X̃sπ , π ′ (·|s, X̃sπ ) dWs .
t ∂x
Because π ′ = Iπ, it follows from Lemma 13 that for any (s, y) ∈ [0, T ] × Rd , we have
Z Z
q(s, y, a; π)−γ log π ′ (a|s, y) π ′ (a|s, y)da ≥

q(s, y, a; π)−γ log π(a|s, y) π(a|s, y)da = 0,
A A
where the equality is due to (20) in Theorem 6. Thus,

Z T Z
−βT π′ −βt π′ −βs ′ ′ ′
e J(T, X̃T ; π) − e J(t, X̃t ; π) + e [r(s, X̃sπ , a) − γ log π ′ (a|s, X̃sπ )]π ′ (a|s, X̃sπ )dads
t A
Z T
∂ ′ ′ ′
e−βs J(s, X̃sπ ; π) ◦ σ̃ s, X̃sπ , π ′ (·|s, X̃sπ ) dWs .

≥
t ∂x
The above argument and the resulting inequalities are also valid when T is replaced by
′
T ∧ un , where un = inf{s ≥ t : |X̃sπ | ≥ n} is a sequence of stopping times. Therefore,

W ′
J(t, x; π) ≤EP e−β(T ∧un −t) h(X̃Tπ∧un )
Z T ∧un Z
−β(s−t) ′ ′ ′ ′ ′
+ e [r(s, X̃sπ , a) − γ log π (a|s, X̃sπ )]π ′ (a|s, X̃sπ )dads X̃tπ =x .
t A
(61)
51
Jia and Zhou
It follows from Assumption 1-(iv), Definition 1-(iii) and the moment estimate in Jia and
Zhou (2022b, Lemma 2) that there exist constants µ, C > 0 such that
Z T ∧un Z
W ′ ′ ′ ′
EP e−β(s−t) [r(s, X̃sπ , a) − γ log π ′ (a|s, X̃sπ )]π ′ (a|s, X̃sπ )dads X̃tπ = x
t A
Z T Z
P W π ′ ′ π′ ′ π′ π′
≤E |r(s, X̃s , a) − γ log π (a|s, X̃s )|π (a|s, X̃s )dads X̃t = x
t A
Z T
PW π′ µ π′
≤C E (1 + |X̃s | ) X̃t = x dt ≤ C(1 + |x|µ ).
t
By the dominated convergence theorem, we have as n → ∞,

Z T ∧un Z
PW −β(s−t) π′ ′ π′ ′ π′ π′
E e [r(s, X̃s , a) − γ log π (a|s, X̃s )]π (a|s, X̃s )dads X̃t = x
t A
Z T Z
PW −β(s−t) π′ ′ π′ ′ π′ π′
→E e [r(s, X̃s , a) − γ log π (a|s, X̃s )]π (a|s, X̃s )dads X̃t = x .
t A
In addition,

PW −β(T ∧un −t) π′ π′
E e h(X̃T ∧un ) X̃t = x

≤E PW
h(X̃T )1{max
π′
π′
π′
X̃t = x + E PW
h(X̃un )1{max
π′
π′
π′
X̃t = x
t≤s≤T |X̃T |<n} t≤s≤T |X̃T |≥n}

≤CE PW π′ µ π′
(1 + |X̃T | ) X̃t = x + C(1 + n )E µ PW
1{maxt≤s≤T |X̃ π′ |≥n} X̃t = x
π′
T

W ′ ′
EP maxt≤s≤T |X̃Tπ |µ+1 X̃tπ = x
≤C(1 + |x|µ ) + C(1 + nµ )
nµ+1
µ
C(1 + n )(1 + |x| µ+1 )
≤C(1 + |x|µ ) + µ+1
< ∞.
n
Again, by the dominated convergence theorem, we have as n → ∞,

W ′ ′ W ′ ′
EP e−β(T ∧un −t) h(X̃Tπ∧un ) X̃tπ = x → EP e−β(T −t) h(X̃Tπ ) X̃tπ = x .
Hence, sending n → ∞, we conclude from (61) that J(t, x; π) ≤ J(t, x; π ′ ). This proves
that π ′ = Iπ improves upon π. Moreover, if Iπ ≡ π ′ = π, then J(t, x; π) ≡ J(t, x; π ′ ) =:
J ∗ (t, x), which satisfies the PDE (11). However, Lemma 13 shows that
∂J ∗ ∂2J ∗
Z
(t, x) − γ log π (a|t, x) π ′ (a|t, x)da
′

H t, x, a, (t, x),
A ∂x ∂x2
∂J ∗ ∂2J ∗
Z

= sup H t, x, a, (t, x), (t, x) − γ log π(a) π(a)da.
π∈P(A) A ∂x ∂x2
This means that J ∗ also satisfies the HJB equation (12), implying that J ∗ is the optimal
value function and hence π is the optimal policy.
52
Proof of Proposition 3
Proof Breaking the full time period [t, T ] into subperiods [t, t + ∆t) and [t + ∆t, T ], while
conditioning on the state at t + ∆t for the second subperiod, we obtain:
Q∆t (t, x, a; π)
Z t+∆t
=EP
e−β(s−t) r(s, Xsa , a)ds
t
Z T
−β(s−t) −β(T −t)
[r(s, Xsπ , aπ log π(aπ π
h(XTπ )|Xt+∆t
a
Xtπ̃

+ EP e s) −γ s |s, Xs )]ds +e =x
t+∆t
Z t+∆t
PW −β(s−t) −β∆t
=E e r(s, Xsa , a)ds +e J(t + a
∆t, Xt+∆t ; π) Xtπ̃ =x
t
Z t+∆t
W
=EP e−β(s−t) r(s, Xsa , a)ds + e−β∆t J(t + ∆t, Xt+∆t
a
; π) − J(t, x; π) Xtπ̃ = x + J(t, x; π)
t
t+∆t
∂2J
Z
PW
∂J ∂J
e−β(s−t)
(s, Xsa ; π) + H s, Xsa , a, (s, Xsa ; π), 2 (s, Xsa ; π)

=E
t ∂t ∂x ∂x

− βJ(s, Xsa ; π) ds Xtπ̃ = x + J(t, x; π)

∂2J

∂J ∂J
=J(t, x; π) + (t, x; π) + H t, x, a, (t, x; π), 2 (t, x; π) − βJ(t, x; π) ∆t + o(∆t),
∂t ∂x ∂x
(62)
where the second to the last equality follows from Itô’s lemma and the last equality and the
error order are due to the approximation of the integral involved.
Proof of Corollary 5
Proof Its proof directly follows from the results of Proposition 3.
Proof of Theorem 6
Proof First, (20) follows readily from its definition in Definition 4, the Feynman–Kac
formula (11), and the fact that ∂J
∂t (t, x; π) and βJ(t, x; π) both do not depend on action a.
53
Jia and Zhou
(i) We now focus on (18). Applying Itô’s lemma to the process e−βs J(s, Xsπ ; π), we obtain
for 0 ≤ t < s ≤ T :
Z s
e−βs J(s, Xsπ ; π) − e−βt J(t, x; π) + e−βu [r(u, Xuπ , aπ π π
u ) − q̂(u, Xu , au )]du
t
Z s h ∂J
π π ∂J ∂2J
e−βu (u, Xuπ , aπ π
(u, Xuπ ; π) − βJ(u, Xuπ ; π)

= u ) + H u, X u , au , (u, X u ; π), 2
t ∂t ∂x ∂x
Z s
i ∂
− q̂(u, Xuπ , aπ
u ) du + e−βu J(u, Xuπ ; π) ◦ dWu
t ∂x
Z s Z s
∂
e−βu q(u, Xuπ , aπ π π
e−βu J(u, Xuπ ; π) ◦ σ(u, Xuπ , aπ

= u ; π) − q̂(u, X u , a u ) du + u )dWu .
t t ∂x
Recall that {aπ

s , t ≤ s ≤ T } is {Fs }s≥0 -progressively measurable. So, if q̂ ≡ q(·, ·, ·; π),
then the above process, and hence (18), is an ({Fs }s≥0 , P)-martingale on [t, T ].
Conversely, if the right hand side of the aboveR s is a martingale, then, because its second
term is a local martingale, we have that t e−βu q(u, Xuπ , aπ π , aπ ) du is

u ; π) − q̂(u, Xu u
a continuous local martingale R swith finite variation and hence zero quadratic variation.
Therefore, P-almost surely, t e−βu q(u, Xuπ , aπ π , aπ ) du = 0 for all s ∈

u ; π) − q̂(u, Xu u
[t, T ]; see, e.g., (Karatzas and Shreve, 2014, Chapter 1, Exercise 5.21).
Denote f (t, x, a) = q(t, x, a; π) − q̂(t, x, a). Then f is a continuous function that maps
[0, T ]×Rd ×A to R. Suppose the desired conclusion is not true, then there exists a triple
(t∗ , x∗ , a∗ ) and ϵ > 0 such that f (t∗ , x∗ , a∗ ) > ϵ. Because f is continuous, there exists
δ > 0 such that f (u, x′ , a′ ) > ϵ/2 for all (u, x′ , a′ ) with |u − t∗ | ∨ |x′ − x∗ | ∨ |a′ − a∗ | < δ.
Here “∨” means taking the larger one, i.e., u ∨ v = max{u, v}.
Now consider the state process, still denoted by X π , starting from (t∗ , x∗ , a∗ ), namely,
{Xsπ , t∗ ≤ s ≤ T } follows (6) with Xtπ∗ = x∗ and aπ ∗
t∗ = a . Define
τ = inf{u ≥ t∗ : |u − t∗ | > δ or |Xuπ − x∗ | > δ} = inf{u ≥ t∗ : |Xuπ − x∗ | > δ} ∧ (t∗ + δ).
The continuity of X π implies that u > t∗ , P-almost surely. Here “∧” means taking
the smaller one, i.e., u ∧ v = min{u, v}.
R s proved that there exists Ω0 ∈ F with P(Ω0 ) ∗= 0 such that for all
We have already
ω ∈ Ω \ Ω0 , t∗ e−βu f (u, Xuπ (ω), aπ
u (ω))du = 0 for all s ∈ [t , T ]. It follows from
Lebesgue’s differentiation theorem that for any ω ∈ Ω \ Ω0 ,
∗
f (s, Xsπ (ω), aπ
s (ω)) = 0, a.e. s ∈ [t , τ (ω)].
Consider the set Z(ω) = {s ∈ [t∗ , τ (ω)] : aπ ∗ ∗ ∗

s (ω) ∈ Bδ (a )} ⊂ [t , τ (ω)], where Bδ (a ) =
{a′ ∈ A : |a′ − a∗ | > δ} is the neighborhood of a∗ . Because f (s, Xsπ (ω), aπ s (ω)) > 2
ϵ
when s ∈ Z(ω), we conclude that Z(ω) has Lebesgue measure zero for any ω ∈ Ω \ Ω0 .
That is,
Z
1{s≤τ (ω)} 1{aπs (ω)∈Bδ (a∗ )} ds = 0.
[t∗ ,T ]
54
Integrating ω with respect to P and applying Fubini’s theorem, we obtain

Z Z Z
0= 1{s≤τ (ω)} 1{aπs (ω)∈Bδ (a∗ )} dsP(dω) = E[1{s≤τ } 1{aπs ∈Bδ (a∗ )} ]ds
Ω [t∗ ,T ] [t∗ ,T ]
Z T Z T " Z #
E 1{s≤τ } P (aπ ∗
E 1{s≤τ } π(a|s, Xsπ )da ds

= s ∈ Bδ (a )|Fs ) ds =
t∗ t∗ Bδ (a∗ )
(Z )Z
T
π(a|u, x′ )da 1{s≤τ } ds

≥ min E
|x′ −x∗ |<δ,|u−t∗ |<δ Bδ (a∗ ) t∗
(Z )
= min π(a|u, x′ )da E[(τ ∧ T ) − t∗ ] ≥ 0.
|x′ −x∗ |<δ,|u−t∗ |<δ Bδ (a∗ )
nR o
Since τ > t∗ P-almost surely, the above implies min|x′ −x∗ |<δ,|u−t∗ |<δ Bδ (a∗ ) π(a|u, x′ )da =
0. However, this contradicts Definition 1 about an admissibleR policy. Indeed, Defi-
nition 1-(i) stipulates supp π(·|t, x) = A for any (t, x); hence Bδ (a∗ ) π(a|t, x)da > 0.
nR o
Then the continuity in Definition 1-(ii) yields min|x′ −x∗ |<δ,|u−t∗ |<δ Bδ (a∗ ) π(a|u, x′ )da >
0, a contradiction. Hence we conclude q(t, x, a; π) = q̂(t, x, a) for every (t, x, a).
′
(ii) Applying Itô’s lemma to e−βs J(s, Xsπ ; π), we get
Z s
−βs π′ −βt ′ ′ π′ π′
e J(s, Xs ; π) − e J(t, x; π) + e−βu [r(u, Xuπ , aπ
u ) − q̂(u, Xu , au )]du
t
Z s Z s
−βu ′ ′ ′ π′ ∂ ′ ′ ′
π π π
e−βu J(u, Xuπ ; π) ◦ σ(u, Xuπ , aπ

= e q(u, Xu , au ; π) − q̂(u, Xu , au ) du + u )dWu .
t t ∂x
So, when q̂ ≡ q(·, ·, ·; π), the above process is an ({Fs }s≥0 , P)-martingale on [t, T ].
(iii) Let πR′ ∈ Π be given satisfying the assumption in this part. It then follows from (ii)
s ′ ′ π′ π′
that t e−βu [q̂(u, Xuπ , aπ u ) − q(u, Xu , au ; π)]du is an ({Fs }s≥0 , P)-martingale. If the
desired conclusion is not n true, then the same argument
o in (i) still applies to conclude
′ ′
R
that min|x′ −x∗ |<δ,|u−t∗ |<δ Bδ (a∗ ) π (a|u, x )da = 0, which is a contradiction because
π ′ is admissible.
Proof of Theorem 7
Proof
(i) We only prove the “only if” part because the “if” part is straightforward following the
same argument as in the proof of Theorem 6.
ˆ X π )+ t e−βs [r(s, X π , aπ )− q̂(s, X π , aπ )]ds is an ({Fs }s≥0 , P)-
So, we assume e−βt J(t,
R
t 0 s s s s
martingale. Hence for any initial state (t, x), we have
Z T
−βT ˆ
P
E e π
J(T, XT ) + e [r(s, Xs , as ) − q̂(s, Xs , as )]ds Ft = e−βt J(t,
−βs π π π π ˆ x).
t
55
Jia and Zhou
We integrate over the action randomization with respect to the policy π, and then
obtain
Z T Z
PW −βT ˆ −βs
E e π
J(T, X̃T ) + e π π
s s
π
[r(s, X̃ , a) − q̂(s, X̃ , a)]π(a|s, X̃ )dads F W
s = e−βt J(t,
ˆ x)
t
t A
This, together with the terminal condition J(T,ˆ x) = h(x), and constraint (21) yields
that
Z T Z
W −β(T −t) −β(s−t)
ˆ x) = E
J(t, P
e π
h(X̃T ) + e π π π W
[r(s, X̃s , a) − γ log π(s, X̃s , a)]π(a|s, X̃s )dads Ft .
t A
ˆ x) = J(t, x; π) for all (t, x) ∈ [0, T ]×Rd . Furthermore, based on Theorem 6,
Hence J(t,
the martingale condition implies that q̂(t, x, a) = q(t, x, a; π) for all (t, x, a) ∈ [0, T ] ×
Rd × A.
(ii) This follows immediately from Theorem 6-(ii).
(iii) Let π ′ ∈ Π be given satisfying the assumption in this part. Define
∂ Jˆ ∂ Jˆ 1 ∂ 2 Jˆ
r̂(t, x, a) := (t, x) + b(t, x, a) ◦ (t, x) + σσ ⊤ (t, x, a) ◦ ˆ x).
(t, x) − β J(t,
∂t ∂x 2 ∂x2
Then Z s
ˆ X π′ ) −
e−βs J(s,
′
e−βu r̂(u, Xuπ , aπ
′
s u )du
t
is an ({Fs }s≥0 , P)-local martingale, which follows from applying Itô’s lemma to the
above
R s −βuprocess πand arguing similarly to the proof of Theorem 6-(ii). As a result,
′ π ′ )−q̂(u, X π ′ , aπ ′ )+r̂(u, X π ′ , aπ ′ )]du is an ({F }
t e [r(u, Xu , au u u u u s s≥0 , P)-martingale.
The same argument as in the proof of Theorem 6 applies and we conclude
q̂(t, x, a) =r(t, x, a) + r̂(t, x, a)
∂ Jˆ ∂ Jˆ 1 ∂ 2 Jˆ
= (t, x) + b(t, x, a) ◦ (t, x) + σσ ⊤ (t, x, a) ◦ ˆ x) + r(t, x, a)
(t, x) − β J(t,
∂t ∂x 2 ∂x2
∂ Jˆ ∂ Jˆ ∂ 2 Jˆ
ˆ x)
= (t, x) + H t, x, a, (t, x), 2 (t, x) − β J(t,
∂t ∂x ∂x
for every (t, x, a). Now the constraint (21) reads
Z " ˆ #
∂J ∂ Jˆ ∂ 2 Jˆ
ˆ x) − γ log π(a|t, x) π(a|t, x)da = 0,
(t, x) + H t, x, a, (t, x), 2 (t, x) − β J(t,
A ∂t ∂x ∂x
ˆ x) = h(x), is the
for all (t, x), which, together with the terminal condition J(T,
Feynman–Kac PDE (11) for J. ˆ Therefore, it follows from the uniqueness of the
ˆ
solution to (11) that J ≡ J(·, ·; π). Moreover, it follows from Theorem 6-(iii) that
q̂ ≡ q(·, ·, ·; π).
exp{ γ1 q̂(t,x,a)}
Finally, if π(a|t, x) = R
exp{ γ1 q̂(t,x,a)}da
, then π = Iπ where I is the map defined in
A
Theorem 2. This in turn implies π is the optimal policy and, hence, Jˆ is the optimal value
function
56
Proof of Proposition 8
Proof Divide both sides of (24) by γ, take exponential and then integrate over A. Com-
paring the resulting equation with the exploratory HJB equation (14), we get
Z
1 ∗
γ log exp{ q (t, x, a)}da = 0.
A γ
This yields (25). The expression (26) follows then from (13).
Proof of Theorem 9
Proof
c∗ = J ∗ and qb∗ = q ∗ be the optimal value function and optimal q-function

(i) Let J
respectively. For any π ∈ Π, applying Itô’s lemma to the process e−βs J ∗ (s, Xsπ ), we
obtain for 0 ≤ t < s ≤ T :
Z s
−βs ∗ −βt ∗
e J (s, Xsπ ) −e J (t, x) + e−βu [r(u, Xuπ , aπ ∗ π π
u ) − q (u, Xu , au )]du
t
Z s h ∂J ∗ ∗ 2 ∗
π π ∂J π ∂ J
= e−βu (u, Xuπ , aπ
u ) + H u, X u , au , (u, Xu ), 2
(u, Xuπ ) − βJ ∗ (u, Xuπ )
t ∂t ∂x ∂x
Z s
∗
i ∂
π π
− q (u, Xu , au ) du + e−βu J ∗ (u, Xuπ ) ◦ dWu
t ∂x
Z s
∂
= e−βu J ∗ (u, Xuπ ) ◦ σ(u, Xuπ , aπ
u )dWu ,
t ∂x
where the · · · du term vanishes due to the definition of q ∗ in (24). Hence (28) is a
R
martingale.
(ii) The second constraint in (27) implies that π c∗ (a|t, x) := exp{ 1 qb∗ (t, x, a)} is a prob-
γ
ability density function, and qb∗ (t, x, a) = γ log π
c∗ (a|t, x). So qb∗ (t, x, a) satisfies the
second constraint in (21) with respect to the policy π c∗ . When (28) is an ({Fs }s≥0 , P)-
martingale under the given admissible policy π, it follows from Theorem 7–(iii) that
c∗ and qb∗ are respectively the value function and the q-function associated with π
J c∗ .
exp{ γ1 qb∗ (t,x,a)}
Then the improved policy is I π c∗ (a|t, x) := R 1 ∗ = exp{ 1 qb∗ (t, x, a)} =
exp{ γ qb (t,x,a)}da γ
A
c∗ (a|t, x).
π However, Theorem 2 yields that c∗
π is optimal, completing the proof.
57
Jia and Zhou
Proof of Theorem 10
Proof Let us compute the KL-divergence:
∂2J
Z
1 ∂J
log π ′ (a|t, x) − H t, x, a, (t, x; π), 2 (t, x; π) π ′ (a|t, x)da

A γ ∂x ∂x
∂2J

′ 1 ∂J
=DKL π (·|t, x) exp{ H t, x, ·, (t, x; π), 2 (t, x; π) }
γ ∂x ∂x
∂2J

1 ∂J
≤DKL π(·|t, x) exp{ H t, x, ·, (t, x; π), 2 (t, x; π) }
γ ∂x ∂x
Z
1 ∂J 2
∂ J
= log π(a|t, x) − H t, x, a, (t, x; π), 2 (t, x; π) π(a|t, x)da.
A γ ∂x ∂x
′
R ′ R
Hence A q(t, x, a; π)−γ log π (a|t, x) π (a|t, x)da ≥ A q(t, x, a; π)−γ log π(a|t, x) π(a|t, x)da.
Following the same argument as in the proof of Theorem 2, we conclude J(t, x; π ′ ) ≥
J(t, x; π).
Proof of Lemma 11
Proof The first statement has been prove in Jia and Zhou (2022b, Lemma 7). We focus
on the proof of the second statement, which is similar to that of Theorem 2.
For the two admissible policies π and π ′ , we apply Itô’s lemma to the value function
′
under π along the process X̃ π to obtain
Z uZ
π′ π′ ′ ′ ′
J(X̃u ; π) − J(X̃t ; π) + [r(X̃sπ , a) − γ log π ′ (a|X̃sπ )]π ′ (a|X̃sπ )dads
t A
Z uZ
′ ∂J ′ ∂2J ′ ′ ′
H X̃sπ , a, (X̃sπ ; π), 2 (X̃sπ ; π) − γ log π ′ (a|Xsπ ) π ′ (a|X̃sπ )dads

=
∂x ∂x
t
Z uA
∂ ′ ′ ′
+ J(X̃sπ ; π) ◦ σ̃ X̃sπ , π ′ (·|X̃sπ ) dWs
t ∂x
Z uZ
′ ∂J ′ ∂2J ′ ′ ′
H X̃sπ , a, (X̃sπ ; π), 2 (X̃sπ ; π) − γ log π(a|Xsπ ) π(a|X̃sπ )dads

≥
∂x ∂x
t
Z uA
∂ ′ ′ ′
+ J(X̃sπ ; π) ◦ σ̃ X̃sπ , π ′ (·|X̃sπ ) dWs
t ∂x
Z u
∂ ′ ′ ′
=V (π)(u − t) + J(X̃sπ ; π) ◦ σ̃ X̃sπ , π ′ (·|X̃sπ ) dWs ,
t ∂x
where the inequality is due to the same argument as in the proof of Theorem 10, and the
last equality is due to the Feynman–Kac formula (38).
Therefore, after a localization argument, we have
Z TZ
1 π′ π′ ′ π′ ′ π′ π′
E J(X̃T ; π)−J(x; π)+ [r(X̃s , a)−γ log π (a|X̃s )]π (a|X̃s )dads X̃0 = x ≥ V (π).
T 0 A
Taking the limit T → ∞, we obtain V (π ′ ) ≥ V (π).
58
Proof of Theorem 12
Proof The proof is parallel to that of Theorems 6 and 7, which is omitted.
References
L. C. Baird. Advantage updating. Technical report, Write Lab Wright-Patterson Air Force
Base, OH 45433-7301, USA, 1993.
L. C. Baird. Reinforcement learning in continuous time: Advantage updating. In Proceedings

of 1994 IEEE International Conference on Neural Networks (ICNN’94), volume 4, pages
2448–2453. IEEE, 1994.
C. Beck, M. Hutzenthaler, and A. Jentzen. On nonlinear Feynman–Kac formulas for vis-

cosity solutions of semilinear parabolic partial differential equations. Stochastics and
Dynamics, page 2150048, 2021.
G. Brunick and S. Shreve. Mimicking an Itô process by a solution of a stochastic differential

equation. The Annals of Applied Probability, 23(4):1584–1628, 2013.
K. Doya. Reinforcement learning in continuous time and space. Neural Computation, 12

(1):219–245, 2000.
Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel. Benchmarking deep re-

inforcement learning for continuous control. In International Conference on Machine
Learning, pages 1329–1338. PMLR, 2016.
J. Fan, Z. Wang, Y. Xie, and Z. Yang. A theoretical analysis of deep Q-learning. In Learning
for Dynamics and Control, pages 486–489. PMLR, 2020.
W. H. Fleming and H. M. Soner. Controlled Markov Processes and Viscosity Solutions,

volume 25. Springer Science & Business Media, 2006.
X. Gao, Z. Q. Xu, and X. Y. Zhou. State-dependent temperature control for Langevin

diffusions. SIAM Journal on Control and Optimization, pages 1–26, 2020.
S. Gu, T. Lillicrap, I. Sutskever, and S. Levine. Continuous deep Q-learning with model-
based acceleration. In International Conference on Machine Learning, pages 2829–2838.
PMLR, 2016.
T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft actor-critic: Off-policy maximum

entropy deep reinforcement learning with a stochastic actor. In International Conference
on Machine Learning, pages 1861–1870. PMLR, 2018a.
T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan, V. Kumar, H. Zhu,

A. Gupta, P. Abbeel, et al. Soft actor-critic algorithms and applications. arXiv preprint
arXiv:1812.05905, 2018b.
59
Jia and Zhou
X. Han, R. Wang, and X. Y. Zhou. Choquet regularization for reinforcement learning. arXiv
preprint arXiv:2208.08497, 2022.
Y. Jia and X. Y. Zhou. Policy evaluation and temporal-difference learning in continuous
time and space: A martingale approach. Journal of Machine Learning Research, 23(198):
1–55, 2022a.
Y. Jia and X. Y. Zhou. Policy gradient and actor-critic learning in continuous time and
space: Theory and algorithms. Journal of Machine Learning Research, 23(275):1–50,
2022b.
I. Karatzas and S. Shreve. Brownian Motion and Stochastic Calculus, volume 113. Springer,
2014.
J. Kim, J. Shin, and I. Yang. Hamilton-Jacobi deep Q-learning for deterministic continuous-
time systems with Lipschitz continuous controls. Journal of Machine Learning Research,
22:206–1, 2021.
A. E. Lim and X. Y. Zhou. Mean-variance portfolio selection with random parameters in a
complete market. Mathematics of Operations Research, 27(1):101–120, 2002.
X. Mao. Stochastic Differential Equations and Applications. Woodhead, 2 edition, 2007.
ISBN 978-1-904275-34-3.
F. S. Melo, S. P. Meyn, and M. I. Ribeiro. An analysis of reinforcement learning with
function approximation. In Proceedings of the 25th International Conference on Machine
Learning, pages 664–671, 2008.
S. P. Meyn and R. L. Tweedie. Stability of Markovian processes III: Foster–Lyapunov
criteria for continuous-time processes. Advances in Applied Probability, 25(3):518–548,
1993.
V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves,
M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep
reinforcement learning. Nature, 518(7540):529–533, 2015.
V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and
K. Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In Interna-
tional Conference on Machine Learning, pages 1928–1937. PMLR, 2016.
B. O’Donoghue. Variational Bayesian reinforcement learning with regret bounds. Advances
in Neural Information Processing Systems, 34:28208–28221, 2021.
I. Osband, B. Van Roy, D. J. Russo, and Z. Wen. Deep exploration via randomized value
functions. Journal of Machine Learning Research, 20(124):1–62, 2019.
J. Schulman, X. Chen, and P. Abbeel. Equivalence between policy gradients and soft Q-
learning. arXiv preprint arXiv:1704.06440, 2017.
D. Shah and Q. Xie. Q-learning with nearest neighbors. Advances in Neural Information
Processing Systems, 31, 2018.
60
Y. Sun. The exact law of large numbers via Fubini extension and characterization of
insurable risks. Journal of Economic Theory, 126(1):31–69, 2006.
R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. Cambridge, MA:

MIT Press, 2018.
C. Tallec, L. Blier, and Y. Ollivier. Making deep Q-learning methods robust to time dis-
cretization. In International Conference on Machine Learning, pages 6096–6104. PMLR,
2019.
W. Tang, Y. P. Zhang, and X. Y. Zhou. Exploratory HJB equations and their convergence.
SIAM Journal on Control and Optimization, 60(6):3191–3216, 2022.
M. Uehara, C. Shi, and N. Kallus. A review of off-policy evaluation in reinforcement

learning. arXiv preprint arXiv:2212.06355, 2022.
H. Wang and X. Y. Zhou. Continuous-time mean–variance portfolio selection: A reinforce-

ment learning framework. Mathematical Finance, 30(4):1273–1308, 2020.
H. Wang, T. Zariphopoulou, and X. Y. Zhou. Reinforcement learning in continuous time

and space: A stochastic control approach. Journal of Machine Learning Research, 21
(198):1–34, 2020.
C. J. Watkins and P. Dayan. Q-learning. Machine Learning, 8(3):279–292, 1992.
C. J. C. H. Watkins. Learning from delayed rewards. PhD thesis, Cambridge University,

1989.
W. M. Wonham. On the separation theorem of stochastic control. SIAM Journal on Control,

6(2):312–326, 1968.
J. Yong and X. Y. Zhou. Stochastic Controls: Hamiltonian Systems and HJB Equations.
New York, NY: Spinger, 1999.
X. Y. Zhou. Curse of optimality, and how do we break it. SSRN preprint SSRN 3845462,
2021.
X. Y. Zhou and D. Li. Continuous-time mean-variance portfolio selection: A stochastic LQ

framework. Applied Mathematics and Optimization, 42(1):19–33, 2000.
X. Y. Zhou and G. Yin. Markowitz’s mean-variance portfolio selection with regime switch-
ing: A continuous-time model. SIAM Journal on Control and Optimization, 42(4):1466–
1482, 2003.
S. Zou, T. Xu, and Y. Liang. Finite-sample analysis for SARSA with linear function
approximation. Advances in Neural Information Processing Systems, 32, 2019.
61

Jia Zhou - JMLR 2023

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Jia Zhou - JMLR 2023

Uploaded by

Copyright:

Available Formats

Journal of Machine Learning Research 24 (2023) 1-61 Submitted 7/22; Revised 4/23; Published 5/23

q-Learning in Continuous Time

Yanwei Jia jiayw1993@gmail.com

Editor: Marc Bellemare

©2023 Yanwei Jia and Xun Yu Zhou.

2 Problem Formulation and Preliminaries

3. The code to reproduce our simulation studies is publicly available at https://www.dropbox.com/sh/

2.1 Classical model-based formulation

(i) b, σ, r, h are all continuous functions in their respective arguments;

|φ(t, x, a) − φ(t, x′ , a)| ≤ C|x − x′ |, ∀(t, a) ∈ [0, T ] × A, ∀x, x′ ∈ Rd ;

Moreover, the following function, which maps a time–state pair to an action:

2.2 Exploratory formulation in reinforcement learning

defined on (Ω, F, P; {Fs }s≥0 ), where aπ = {aπ s , t ≤ s ≤ T } is an {Fs }s≥0 -progressively

function to encourage exploration (represented by the stochastic policy), leading to

Definition 1 A policy π = π(·|·, ·) is called admissible if

(ii) π(a|t, x) is continuous in

2.3 Some useful preliminary results

or, after normalization,

has a fixed point π ∗ , then π ∗ is the optimal policy.

3 q-Function in Continuous Time: The Theory

Q∆t (t, x, a; π) − J(t, x; π)

is an ({Fs }s≥0 , P)-martingale, where {Xsπ , t ≤ s ≤ T } is the solution to (6) under π

Moreover, in any of the three cases above, the q-function satisfies

q(t, x, a(t, x); a) = 0,

which corresponds to equation (23) in Tallec et al. (2019).

Theorem 7 Let a policy π ∈ Π, a function Jˆ ∈ C 1,2 [0, T ) × Rd ∩ C [0, T ] × Rd with

polynomial growth, and a continuous function q̂ : [0, T ] × Rd × A → R be given satisfying

is an ({Fs }s≥0 , P)-martingale, where {Xsπ , t ≤ s ≤ T } is the solution to (6) with

then π is the optimal policy and Jˆ is the optimal value function.

3.3 Optimal q-function

c∗ ∈ C 1,2 [0, T ) × Rd ∩ C [0, T ] × Rd with polynomial growth

is an ({Fs }s≥0 , P)-martingale, where {Xsπ , t ≤ s ≤ T } is the solution to (6) under

4 q-Learning Algorithms When Normalizing Constant Is Available

4.1 q-Learning algorithms

with q1ψ (t, x) ∈ Rm , q2ψ (t, x) ∈ Sm ψ

with its entropy value being − 21 logdet q2ψ (t, x) +

• Minimize the martingale loss function:

To solve these equations iteratively, we use stochastic approximation to update (θ, ψ)

4.2 Connections with SARSA

Q∆t (t, x, a; π) ≈ J(t, x; π) + q(t, x, a; π)∆t.

Algorithm 1 Offline–Episodic q-Learning ML Algorithm

Algorithm 2 Offline–Episodic q-Learning Algorithm

Algorithm 3 Online-Incremental q-Learning Algorithm

Note that by (32), we have

5 q-Learning Algorithms When Normalizing Constant Is Unavailable

5.1 A stronger policy improvement theorem

the set {π ϕ (a|t, x)}ϕ∈Φ by minimizing

Theorem 10 Given (t, x) ∈ [0, T ] × Rd , if two policies π ∈ Π and π ′ ∈ Π satisfy

then J(t, x; π ′ ) ≥ J(t, x; π).

The gradient of the above objective function in ϕ′ is

Therefore, we may update ϕ each step incrementally by

5.2 Connections with policy gradient

6 Extension to Ergodic Tasks

where γ ≥ 0 is the temperature parameter.

Lemma 11 Let π = π(·|·) be a given admissible (stationary) policy.14 Suppose there is a

Then for any t ≥ 0,

We emphasize that the solution to (38) is a pair of (J, V ), where J is a function of

As a counterpart to Theorem 7, we have the following theorem to characterize the value

Theorem 12 Let an admissible policy π, a number V̂ , a function Jˆ ∈ C 2 Rd with poly-

nomial growth, and a continuous function q̂ : Rd × A → R be given satisfying

then π is the optimal policy and V̂ is the optimal value.

Algorithm 4 q-Learning Algorithm for Ergodic Tasks

with its entropy value being − 21 logdet q2ψ (t, x) +