Professional Documents
Culture Documents
2205.14398 Deep Neural Networks Overcome The Curse
2205.14398 Deep Neural Networks Overcome The Curse
1
Institute of Mathematics, University of Kassel,
cioica-licht@mathematik.uni-kassel.de
2
Faculty of Mathematics, University of Duisburg-Essen,
martin.hutzenthaler@uni-due.de
3
Institute of Mathematics, University of Kassel,twerner@mathematik.uni-kassel.de
Abstract
We prove that deep neural networks are capable of approximating solutions of semilinear Kolmogorov
PDE in the case of gradient-independent, Lipschitz-continuous nonlinearities, while the required
number of parameters in the networks grow at most polynomially in both dimension d ∈ N and
prescribed reciprocal accuracy ε. Previously, this has only been proven in the case of semilinear heat
equations.
Contents
1 Introduction 2
1.1 An Introduction to MLP-approximations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1
1 Introduction
Deep-learning based algorithms are employed in a wide field of real world applications and are nowadays
the standard approach for most machine learning related problems. They are used extensively in face and
speech recognition, fraud detection, function approximation, and solving of partial differential equations
(PDEs). The latter are utilized to model numerous phenomena in nature, medicine, economics, and
physics. Often, these PDEs are nonlinear and high-dimensional. For instance, in the famous Black-Scholes
model the PDE dimension d ∈ N corresponds to the number of stocks considered in the model. Relaxing
the rather unrealistic assumptions made in the model results in a loss of linearity within the corresponding
Black-Scholes PDE, cf., e.g., [1, 7]. These PDEs, however, can not in general be solved explicitly and
therefore need to be approximated numerically. Classical grid based approaches such as finite difference
or finite element methods suffer from the curse of dimensionality in the sense that the computational
effort grows exponentially in the PDE dimension d ∈ N and are therefore not appropriate. Among
the most prominent and promising approaches to approximate high-dimensional PDEs are deep-learning
algorithms, which seem to handle high-dimensional problems well, cf., e.g., [2, 3, 5, 11, 19]. However,
compared to the vast field of applications these techniques are successfully applied to nowadays, there
exist only few theoretical results proving that deep neural networks do not suffer from the curse of
dimensionality in the numerical approximation of partial differential equations, see [8, 15–17, 23–25, 27].
The key contribution of this article is a rigorous proof that deep neural networks do overcome the
curse of dimensionality in the numerical approximation of semilinear partial differential equations, where
the nonlinearity, the coefficients determining the linear part, and the initial condition of the PDEs are
Lipschitz continuous. In particular, Theorem 1.1 below proves that the number of parameters in the
approximating deep neural network grows at most polynomially in both the reciprocal of the described
approximation accuracy ε ∈ (0, 1] and the PDE dimension d ∈ N:
d d d
Theorem 1.1. Let k·k : (∪q d∈N R ) → [0, ∞) and Ad : R → R , d ∈ N, satisfy for all d ∈ N, x =
Pd
(x1 , ..., xd ) ∈ Rd that kxk = 2
i=1 (xi ) and Ad (x) = (max{x1 , 0}, ..., max{xd , 0}), let k·kF : (∪d∈N R
d×d
)
qP
d×d d 2
→ [0, ∞) satisfy for all d ∈ N, A = (ai,j )i,j∈{1,...,d} ∈ R that kAkF = i,j=1 (ai,j ) , let h·, ·i :
d d d
Pd
(∪d∈N R ×R ) → R satisfy for all d ∈ N, x = (x1 , ..., xd ), y = (y1 , ..., yd ) ∈ R that hx, yi = i=1 xi yi ,
QH+1 kn ×kn−1
let N = ∪H∈N ∪(k0 ,...,kH+1 )∈NH+2 [ n=1 (R × Rkn )], let R : N → (∪k,l∈N C(Rk , Rl )), D : N →
H+2
∪ N , and P : N → N satisfy for all H ∈ N, k0 , k1 , ..., kH , kH+1 ∈ N, Φ = ((W1 , B1 ), ..., (WH+1 , BH+1 )) ∈
QH∈N
H+1 kn ×kn−1
n=1 (R × Rkn ), x0 ∈ Rk0 , ..., xH ∈ RkH with ∀n ∈ N ∩ [1, H] : xn = Akn (Wn xn−1 + Bn ) that
let c, T ∈ (0, ∞), f ∈ C(R, R), for all d ∈ N let gd ∈ C(Rd , R), ud ∈ C 1,2 ([0, T ] × Rd , R), µd =
(µd,i )i∈{1,...,d} ∈ C(Rd , Rd ), σd = (σd,i,j )i,j∈{1,...,d} ∈ C(Rd , Rd×d ) satisfy for all t ∈ [0, T ], x = (x1 , ..., xd ),
y = (y1 , ..., yd ) ∈ Rd that ud (0, x) = gd (x),
h i
2
|f (x1 ) − f (y1 )| ≤ c|x1 − y1 |, |ud (t, x)|2 ≤ c dc + kxk , (3)
∂
ud (t, x) = h(∇x ud )(t, x), µd (x)i + 21 Tr (σd (x)[σd (x)]∗ (Hessx ud )(t, x)) + f (ud (t, x)), (4)
∂t
for all d ∈ N, v ∈ Rd , ε ∈ (0, 1] let σ̂d,ε ∈ C(Rd , Rd×d ), gd,ε , µ̃d,ε , σ̃d,ε,v ∈ N satisfy for all d ∈ N, x, y, v ∈
Rd , ε ∈ (0, 1], that R(gd,ε ) ∈ C(Rd , R), R(µ̃d,ε ) ∈ C(Rd , Rd ), R(σ̃d,ε,v ) ∈ C(Rd , Rd ), (R(σ̃d,ε,v ))(x) =
2
σ̂d,ε (x)v, D(σ̃d,ε,v ) = D(σ̃d,ε,0 ),
h i
2
|(R(gd,ε ))(x)|2 + max (|[(R(µ̃d,ε ))(0)]i | + |(σ̂d,ε (0))i,j |) ≤ c dc + kxk , (5)
i,j∈{1,...,d}
max{|gd (x) − (R(gd,ε ))(x)|, kµd (x) − (R(µ̃d,ε ))(x)k , kσd (x) − σ̂d,ε (x)kF } ≤ εcdc (1 + kxkc ), (6)
max{|(R(gd,ε ))(x) − (R(gd,ε ))(y)|, k(R(µ̃d,ε ))(x) − (R(µ̃d,ε ))(y)k , kσ̂d,ε (x) − σ̂d,ε (y)kF } ≤ c kx − yk ,
(7)
max{P(gd,ε ), P(µ̃d,ε ), P(σ̃d,ε,v )} ≤ cdc ε−c . (8)
Then there exists η ∈ (0, ∞), such that for all d ∈ N, ε ∈ (0, 1] there exists a Ψd,ε ∈ N satisfying
R(Ψd,ε ) ∈ C(Rd , R), P(Ψd,ε ) ≤ ηdη ε−η and
"Z # 21
2
|ud (T, x) − (R(Ψd,ε ))(x)| dx ≤ ε. (9)
[0,1]d
Theorem 1.1 follows from Corollary 4.2 (applied with ud (t, x) x ud (T − t, x) for t ∈ [0, T ], x ∈ Rd in the
notation of Corollary 4.2), which in turn is a special case of Theorem 4.1, the main result of this article.
Let us comment on the mathematical objects appearing in Theorem 1.1. By N in Theorem 1.1 above
we denote the set of all artificial feed-forward deep neural networks (DNNs), by R in Theorem 1.1 above
we denote the operator that maps each DNN to its corresponding function, by P in Theorem 1.1 above
we denote the function that maps a DNN to its number of parameters. For all d ∈ N we have that
Ad : Rd → Rd in Theorem 1.1 above refers to the componentwise applied rectified linear unit (ReLU)
activation fucntion. The functions µd : Rd → Rd and σd : Rd → Rd×d in Theorem 1.1 above specify the
linear part of the PDEs in (4). The functions gd : Rd → R, d ∈ N, describe the initial condition, while the
function f : R → R specifies the nonlinearity of the PDEs in (4) above. The real number T ∈ (0, ∞) de-
scribes the time horizon of the PDEs in (4) in Theorem 1.1 above. The real number c ∈ (0, ∞) in Theorem
1.1 above is used to formulate a growth condition and a Lipschitz continuity condition of the functions
ud : Rd → R, R(µ̃d,ε ) : Rd → Rd , σ̂d,ε : Rd → Rd×d , R(gd,ε ) : Rd → R, d ∈ N, ε ∈ (0, 1], and f : R → R.
Furthermore, the real number c ∈ (0, ∞) in Theorem 1.1 above formulates an approximation condition
between the functions µd , σd , gd and R(µ̃d,ε ), σ̂d,ε , R(gd,ε ), d ∈ N, ε ∈ (0, 1], as well as an upper boundary
condition for the number of parameters involved in the DNNs gd,ε , µ̃d,ε , σ̃d,ε,v , d ∈ N, v ∈ Rd , ε ∈ (0, 1].
These assumptions guarantee unique existence of viscosity solutions of the PDEs in (4) in Theorem 1.1
above (see Theorem 4.1 and cf., e.g., [10] for the notion of viscosity solution), and that the functions
µd , σd , gd , d ∈ N, can be approximated without the curse of dimensionality by means of DNNs. The
latter is the key assumption in Theorem 1.1 above and means in particular, that Theorem 1.1 can be,
very roughly speaking, reduced to the statement that if DNNs are capable of approximating the initial
condition and the linear part of the PDEs in (4) without the curse of dimensionality, then they are also
capable of approximating its solution without the curse of dimensionality.
Note that the statement of Theorem 1.1 is purely deterministic. The proof of Theorem 4.1, however,
heavily relies on probabilistic tools. It is influenced by the method employed in [23] and is built on the
theory of full history recursive multilevel Picard approximations (MLP), which are numerical approxi-
mation schemes that have been proven to overcome the curse of dimensionality in numerous settings, cf.,
e.g., [6, 12, 13, 21, 22]. In particular, in [22] a MLP approximation scheme is introduced that overcomes
the curse of dimensionality in the numerical approximation of the semilinear PDEs in (4). The central
idea of the proof is, roughly speaking, that these MLP approximations can be well represented by DNNs,
if the coefficients determining the linear part, the initial condition, and the nonlinearity are themselves
represented by DNNs (cf. Lemma 3.10). This explains why we need to assume that the coefficients of the
PDEs in (4) can be approximated by means of DNNs without the curse of dimensionality. In Lemma 2.4
below we establish strong approximation error bounds between solutions of the PDEs in (4) and solutions
3
of the PDEs whose coefficients are specified by corresponding DNN approximations. In Lemma 2.5 below
we construct an artificial probability space and employ a result from [22] to estimate strong error bounds
for the numerical approximation of the approximating PDEs by means of MLP approximations on that
space. In our main result, Theorem 4.1 below, we combine these results and prove existence of a DNN
realization on the artifical probability space that suffices the prescribed approximation accuracy ε ∈ (0, 1]
between the solution of the PDEs in (4) and the constructed DNN, which eventually completes the proof
of Theorem 1.1 above.
The remainder of this article is organized as follows: In section 2 we will establish a full error bound for
MLP approximations in which every involved function is approximated (by means of DNNs). Section
3 provides a mathematical framework for deep neural networks and establishes the connection between
DNNs and MLP approximations, see Lemma 3.10. In section 4 we combine the findings of section 2 and
section 3 to prove our main result, Theorem 4.1, as well as Corollary 4.2.
and Z
T
x x
u(t, x) = E g(Xt,T ) + E f (u(r, Xt,r )) dr. (12)
t
The reverse direction does not work in general, because differentiability of the function in (12) is not
necessarily given. This would need sufficient smoothness of µ, σ, f, and g. Fortunately, one can prove
under more relaxed conditions, that (12) yields a unique, so called viscosity solution of the PDE in (10),
which is a generalized concept of classical solutions. For an introduction and a review on the theory of
viscosity solutions we refer to [10]. In particular, every classical solution of (10) is also a viscosity solution
of (10), hence we get a one-to-one correspondence between both representations. Having established this
connection, we may want to employ a numerical approximation of (12) by means of Monte Carlo methods.
The second term on the right-hand side of (12), however, is problematic since we need to know the very
thing that we wish to approximate, u. A similar problem arises in the field of ordinary differential
1 Note that we say that a filtered probability space (Ω, F , P, (F )
t t∈[0,T ] ) satisfies the usual conditions if and only if it
holds for all t ∈ [0, T ) that {A ∈ F : P(A) = 0} ⊆ Ft = (∪s∈(t,T ] Fs ).
4
equations: let y : [0, T ] → R satisfy y(0) = y0 and for all t ∈ (0, T ) that y ′ = f(y), where f ∈ C(R, R).
This can be reformulated to Z t
y(t) = y0 + f(y(s))ds. (13)
0
If we assume the function f to satisfy a (local) Lipschitz condition, it is ensured that for sufficiently small
t ∈ (0, T ], the Picard iterations
y [0] (t) = y0 ,
Z t (14)
y [n+1] (t) = y0 + f(y [n] (s))ds, n ∈ N0
0
converge to a unique solution of (13), cf., e.g., [28, §6]. Following this idea, we reformulate (12) as
stochastic fixed point equation:
Φ0 (t, x) = E g(Xt,T
x
)
Z T (15)
Φn+1 (t, x) = E g(Xt,T
x
) + E f (Φn (s, Xt,s
x
)) ds, n ∈ N0 .
t
d
Indeed, the proof of [22, Proposition 2.2] provides a Banach space (V, k·kV ), V ⊂ R[0,T ]×R , in which the
mapping Φ : V → V satisfying for all t ∈ [0, T ], x ∈ Rd , v ∈ V that
Z T
x
x
(Φ(v))(t, x) = E g(Xt,T ) + E f (v(s, Xt,s )) ds (16)
t
is, under suitable assumptions, a contraction. Banach’s fixed point theorem thus provides existence
of a unique fixed point u ∈ V with Φ(u) = u and proves convergence of (15). In order to derive an
approximation scheme for the right-hand side of (15), we employ a telescoping sum argument and define
Φ−1 ≡ 0, which gives us for all n ∈ N, t ∈ [0, T ], x ∈ Rd
Z T
Φn+1 (t, x) = E g(Xt,T
x
) + E f (Φn (s, Xt,sx
)) ds
t
Z T n
x
X
= E g(Xt,T ) + E f (Φ0 (s, Xt,s
x
)) + E f (Φl (s, Xt,s
x
)) − f (Φl−1 (s, Xt,s
x
)) ds (17)
t l=1
n Z
X T
)) − 1N (l)f (Φl−1 (s, Xt,s
x
=E g(Xt,T ) + E f (Φl (s, Xt,s
x x
)) ds.
l=0 t
x
With a standard Monte Carlo approach we can approximate E[g(Xt,T )] efficiently. For the sum on the
right-hand side of above equation, note that
RT
a) one can, for all t ∈ [0, T ], h ∈ L1 ([t, T ], B([t, T ]), λ|[t,T ]; R), approximate t h(s)dλ(s) via Monte Carlo
averages by considering the probability space ([t, T ], B([t, T ]), T 1−t λ|[t,T ] ) and sampling independent,
continuous uniformly distributed τ0 , ..., τM : Ω → [t, T ], M ∈ N. This yields
Z T Z T M
1 T −tX
h(s)dλ(s) = (T − t) h(s)d λ|[t,T ] (s) ≈ h(τk ). (18)
t t T −t M
k=0
b) due to the recursive structure of (17), summands of large l are costly to compute, but thanks to
convergence, tend to be of small variance. In order to achieve a prescribed approximation accuracy,
5
these summands will only need relatively few samples for a suitable Monte-Carlo average. On the
other hand, summands of small l are cheap to compute, but have large variance, implying the need of
relatively many samples for a Monte-Carlo average that satisfies a prescribed approximation accuracy.
This is the key insight, which motivates the multilevel approach.
To present the multilevel approach, we need a larger index set. Let Θ = ∪n∈N Zn . Let Ytθ,x : [t, T ] × Ω →
Rd , θ ∈ Θ, t ∈ [0, T ], x ∈ Rd be independent Euler-Maruyama approximations of Xtx (see (57) below
for a concise definition). Let Tθt : Ω → [t, T ], θ ∈ Θ, t ∈ [0, T ], be independent, continuous, uniformly
distributed random variables. In particular, let (Ytθ,x )θ∈Θ and (Tθt )θ∈Θ be independent. Combining this
with (17), a) and b) gives us the desired multilevel Picard approximations (see (58) below for a concise
definition):
n
θ 1N (n) X
M
(θ,0,−i),x
Un,M (t, x) = g(Yt,T )
Mn i=0
(19)
n−1 M n−l
X T −t X
+ f ◦ U
(θ,l,i)
l,M − 1 N (l)f ◦ U
(θ,−l,i)
l−1,M T
(θ,l,i)
t , Y
(θ,l,i),x
(θ,l,i) .
M n−l i=0 t,Tt
l=0
In section 3 we will show that deep neural networks are capable to represent MLP approximations, if
µ, σ, f, and g are DNN functions. For this reason, we assume in section 2 that for a given PDE such as
(10) with coefficient functions µ1 , σ1 , f1 , and g1 , we can approximate its coefficients by (DNN functions)
µ2 , σ2 , f2 , and g2 , respectively. Thus we have to take two sources of error into account:
(i) the perturbation error of the PDE. We will utilize stochastic tools to get an estimate of |u1 (t, x) −
u2 (t, x)|, where for each i ∈ {1, 2} we have
h i Z T h i
x,i x,i
ui (t, x) = E gi (Xt,T ) + E f (ui (s, Xt,s )) ds, (20)
t
Xtx,i solving the stochastic differential equation (SDE) with drift µi and diffusion σi with respect
to W, starting in (t, x) ∈ [0, T ] × Rd (cf. (11), (12), and Lemma 2.4 below).
h i 21
0 2 θ
(ii) the MLP error E Un,M (t, x) − u2 (t, x) , where Un,M , θ ∈ Θ, are the MLP approximations
built by the coefficient functions µ2 , σ2 , f2 , and g2 .
With that, the triangle inequality enables us to find a boundary for the full error:
h 2
i 21 h 2
i 21
0 0
E Un,M (t, x) − u1 (t, x) ≤ E Un,M (t, x) − u2 (t, x) + |u2 (t, x) − u1 (t, x)| . (21)
6
Lemma 2.1. (i) Let d, m ∈ N, c, κ ∈ [1, ∞), p ∈ [2, ∞), let ϕ ∈ C 2 (Rd , [1, ∞)), µ ∈ C(Rd , Rd ), σ ∈
C(Rd , Rd×m ) satisfy for all x, y ∈ Rd , z ∈ Rd \ {0} that
( )
h(∇ϕ)(x), zi hz, (Hessϕ(x))zi c kxk + max{kµ(0)k , kσ(0)kF }
max p−1 , p−2
2
, 1 ≤ c, (22)
(ϕ(x)) p kzk (ϕ(x)) p kzk (ϕ(x)) p
and
max{kµ(x) − µ(y)k , kσ(x) − σ(y)kF } ≤ c kx − yk . (23)
d
Then it holds for all x ∈ R that
1 κ
|h∇(ϕ(x))κ , µ(x)i| + Tr(σ(x)[σ(x)]∗ (Hess(ϕ(x))κ )) ≤ ((κ − 1)c4 + 3c3 )(ϕ(x))κ . (24)
2 2
(ii) Additionally let T ∈ (0, ∞), (Ω, F , P, (Ft )t∈[0,T ] ) be a filtered probability space satisfying the usual
conditions, let W : [0, T ] × Ω → Rm be a standard (Ft )t∈[0,T ] -Brownian motion, for all x ∈ Rd , t ∈
[0, T ] let Xtx : [t, T ] × Ω → Rd be an (Fs )s∈[t,T ] -adapted stochastic process satisfying P-a.s. for all
s ∈ [t, T ] Z Z
s s
x x x
Xt,s =x+ µ(Xt,r )dr + σ(Xt,r )dWr . (25)
t t
Then it holds for all t ∈ [0, T ], s ∈ [t, T ], x ∈ Rd that
x
1 4 3
E (ϕ(Xt,s ))κ ≤ e 2 κ((κ−1)c +3c )(s−t) (ϕ(x))κ . (26)
Proof. Note that for all x ∈ Rd it holds that ∇(ϕ(x))κ = κϕκ−1 (x)(∇ϕ)(x) and
Hess(ϕ(x))κ = κ(κ − 1)(ϕ(x))κ−2 ∇(ϕ(x))[∇(ϕ(x))]∗ + κ(ϕ(x))κ−1 Hessϕ(x). (27)
d
The triangle inequality, (22), and (23) ensure for all x ∈ R that
1
max{kµ(x)k , kσ(x)kF } ≤ c kxk + max{kµ(0)k , kσ(0)kF } ≤ c(ϕ(x)) p . (28)
This and (22) ensure for all x ∈ Rd that
p−2
Tr(σ(x)[σ(x)]∗ Hessϕ(x)) ≤ c kσ(x)k2F (ϕ(x)) p ≤ c3 ϕ(x). (29)
The Cauchy-Schwarz inequality and (22) ensure for all x ∈ Rd that
Tr(σ(x)[σ(x)]∗ ∇ϕ(x)[∇ϕ(x)]∗ ) ≤ kσ(x)[σ(x)]∗ kF k∇ϕ(x)[∇ϕ(x)]∗ kF
2 2 (30)
≤ kσ(x)kF k∇ϕ(x)k ≤ c4 (ϕ(x))2 .
It thus holds for all x ∈ Rd that
Tr(σ(x)[σ(x)]∗ Hess((ϕ(x))κ )) = κ(κ − 1)(ϕ(x))κ−2 Tr(σ(x)[σ(x)]∗ ∇ϕ(x)[∇ϕ(x)]∗ )
+ κ(ϕ(x))κ−1 Tr(σ(x)[σ(x)]∗ Hessϕ(x)) (31)
4 κ 3 κ
≤ κ(κ − 1)c (ϕ(x)) + κc (ϕ(x)) .
This and (22) ensure for all x ∈ Rd that
1
|h∇(ϕ(x))κ , µ(x)i| + Tr(σ(x)[σ(x)]∗ Hess(ϕ(x))κ )
2
1
≤ κ(ϕ(x))κ |h(∇ϕ)(x), µ(x)i| + κ((κ − 1)c4 + c3 )(ϕ(x))κ
2 (32)
1
≤ c2 κ(ϕ(x))κ + κ((κ − 1)c4 + c3 )(ϕ(x))κ
2
1 3
≤ ( κ(κ − 1)c4 + κc3 )(ϕ(x))κ ,
2 2
7
which establishes item (i). This and [9, Lemma 2.2] (applied for every x ∈ Rd , t ∈ [0, T ], s ∈ [t, T ]
with T x T − t, O x Rd , V x ([0, T − t] × Rd ∋ (s, x) → (ϕ(x))κ ∈ [1, ∞)), α x ([0, T − t] ∋ s →
( 21 κ(κ − 1)c4 + 32 κc3 ) ∈ [0, ∞)), τ x s − t, X x (Xt,t+rx
)r∈[0,T −t] , in the notation of [9, Lemma 2.2])
d
proves that for all x ∈ R , t ∈ [0, T ], s ∈ [t, T ] we have that
x
1 4 3
E ϕ(Xt,s ) ≤ e 2 κ((κ−1)c +3c )(s−t) (ϕ(x))κ . (33)
let u1 , u2 ∈ C([0, T ] × Rd , R), assume for all i ∈ {1, 2}, t ∈ [0, T ], x ∈ Rd that
!
|ui (s, y)| h i Z T h i
x,i x,i
sup sup + E gi (Xt,T ) + E (Fi (ui ))(s, Xt,s ) ds < ∞, (41)
s∈[0,T ] y∈Rd ϕ(y) t
h i Z T h i
x,i x,i
ui (t, x) = E gi (Xt,T ) + E (Fi (ui ))(s, Xt,s ) ds. (42)
t
Lemma 2.3. Assume Setting 2.2 and let x ∈ Rd , t ∈ [0, T ], s ∈ [t, T ]. Then it holds that
h qi βq βq 4 3 q q
x,2
E u2 (s, Xt,s ) ≤ 2q−1 bq (T + 1)q (ϕ(x)) p e 2p ((q−1)c +3c )(T −t)+c (T −s) . (43)
8
Proof. Display (41) and Lemma 2.1 (ii) (applied with κ x q in the notation of Lemma 2.1) ensure that
" x,2
!q #! 1q
h i q1 |u2 (s, Xt,s )|
x,2 q x,2
E |u2 (s, Xt,s )| ≤ E x,2 ϕ(Xt,s )
ϕ(Xt,s )
" #
|u2 (r, y)| h q i 1q (44)
x,2
≤ sup sup E ϕ(Xt,s )
r∈[t,T ] y∈Rd ϕ(y)
< ∞.
Note that (35), Jensen’s inequality, and Lemma 2.1 (ii) (applied with κ x q in the notation of Lemma
2.1) imply that
q1
h q i 1q βq
p
x,2 x,2
E g2 (Xt,T ) ≤ b E ϕ(Xt,T )
h
x,2
β
q i qp (45)
≤ b E ϕ(Xt,T )
β 4 β
+3c3 )(T −t)
≤ be 2p ((q−1)c (ϕ(x)) p
q
The triangle inequality, (36), (35), and Lemma 2.1 (ii) (applied with κ x 2 in the notation of Lemma
2.1) imply
Z ! 1q
T h qi
x,2
E (F2 (u2 ))(r, Xt,r ) dr
s
Z ! q1 Z ! q1
T h qi T h qi
x,2 x,2 x,2 x,2
≤ E f2 (r, Xt,r , u2 (r, Xt,r )) − f2 (r, Xt,r , 0) dr + E f2 (r, Xt,r , 0) dr
s s
Z ! q1 Z ! 1q
T h qi T h i
x,2 b x,2 βq
≤c E u2 (r, Xt,r ) dr + E (ϕ(Xt,r )) p dr (46)
s T s
Z ! q1 1
!1
T h qi h i q
x,2 b(T − s) q x,2 βq
≤c E u2 (r, Xt,r ) dr + sup E (ϕ(Xt,r )) p
s T r∈[s,T ]
Z ! q1 1
T h qi
x,2 b(T − s) q 2p
β 4 3 β
≤c E u2 (r, Xt,r ) dr + e ((q−1)c +3c )(T −t) (ϕ(x)) p .
s T
This, (40), the definition of u2 , the triangle inequality, Jensen’s inequality, Fubini’s theorem, and (45)
9
ensure that
" q
Z q #! q
h i T
x,2 q X x,2 ,2 X x,2 ,2
E |u2 (s, Xt,s )| = E g2 Xs,Tt,s + (F2 (u2 ))(r, Xs,rt,s )dr
s
Z ! q1 q
h q i 1q q−1
T h qi
x,2 x,2
≤ E g2 (Xt,T ) + (T − s) q E (F2 (u2 ))(r, Xt,r ) dr
s
(47)
! 1q q
Z T h qi 4
q−1
x,2 β
+3c3 )(T −t) β
≤ (T − s) q c E u2 (r, Xt,r ) dr + 2be 2p ((q−1)c (ϕ(x)) p
s
Z T h qi βq 4 3 βq
x,2
≤2 q−1
(T − s) q−1 q
c E u2 (r, Xt,r ) dr + 2q bq e 2p ((q−1)c +3c )(T −t) (ϕ(x)) p .
s
Proof. First, [20, Theorem 1.2] (applied for every t ∈ [0, T ], s ∈ [t, T ], x ∈ Rd with H x Rd , U x
Rd , D x Rd , T x s − t, (Fr )r∈[0,T ] x (Fr+t )r∈[0,s−t] , (Wr )r∈[0,T ] x (Wt+r − Wt )r∈[0,s−t] , (Xr )r∈[0,T ] x
x,1 x,2 x,2
(Xt,t+r )r∈[0,s−t] , (Yr )r∈[0,T ] x (Xt,t+r )r∈[0,s−t] , µ x µ1 , σ x σ1 , (ar )r∈[0,T ] x (µ2 (Xt,t+r ))r∈[0,s−t] ,
x,2
(br )r∈[0,T ] x (σ2 (Xt,t+r ))r∈[0,s−t] , τ x (Ω ∋ ω → s − t ∈ [0, s − t]), p x 2, q x ∞, r x 2, α x
1, β x 1, ε x 1 in the notation of [20, Theorem 1.2]), combined with the Cauchy-Schwartz inequality,
(37), and (38) show that for all t ∈ [0, T ], s ∈ [t, T ], x ∈ Rd we have
12
2
x,1 x,2
E Xt,s − Xt,s
" #+
2
hz − y, µ 1 (z) − µ 1 (y)i + kσ 1 (z) − σ1 (y)k F 1
≤ sup exp +
z,y∈Rd , kz − yk2 2
z6=y
Z s 12 Z s 21 !
x,2 x,2
2 √ x,2 x,2
2
(50)
· E µ2 (Xt,r ) − µ1 (Xt,r ) dr + 2 E σ2 (Xt,r ) − σ1 (Xt,r ) dr
t t
Z s 12
2
+ 12
√ h
x,2 2q
i
≤ ec+c δ(1 + 2) E ϕ(Xt,r ) dr
t
!1
√ √ h i 2
c+c2 + 12 x,2 2q
≤e δ(1 + 2) s − t sup E ϕ(Xt,r ) .
r∈[t,s]
This combined with Lemma 2.1 (ii) (applied with κ x 2q in the notation of 2.1) hence demonstrates for
all x ∈ Rd , t ∈ [0, T ], s ∈ [t, T ] that
12
x,1 x,2
2 √ 2 1√ 4 3
E Xt,s − Xt,s ≤ δ2 2ec +c+ 2 s − teq((2q−1)c +3c )(s−t) ϕ(x)q . (51)
10
This and Hölder’s inequality ensure for all x ∈ Rd , t ∈ [0, T ], s ∈ [t, T ] that
h i
x,1 x,2 β x,1 x,2
E (ϕ(Xt,s ) + ϕ(Xt,s )) p Xt,s − Xt,s
h i
x,1 x,2 12 x,1 x,2
≤ E (ϕ(Xt,s ) + ϕ(Xt,s )) Xt,s − Xt,s
h i 12 21
2
x,1 x,2 x,1 x,2
≤ E ϕ(Xt,s ) + ϕ(Xt,s ) E Xt,s − Xt,s (52)
21
√ 3 3 1 x,1 x,2
2
≤ 2e 4 c (s−t) (ϕ(x)) 2 E Xt,s − Xt,s
√ 4 3 2 1 1
≤ δ4 s − te(q((2q−1)c +3c ))(s−t)+c +c+ 2 (ϕ(x))q+ 2 .
Next, the triangle inequality and (40) guarantee for all t ∈ [0, T ], s ∈ [t, T ], x ∈ Rd that
h i
x,2 x,1
E u2 (s, Xt,s ) − u1 (s, Xt,s )
Z T
X x,2 ,2 X x,1 ,1 X x,2 ,2 X x,1 ,1
≤ E g2 Xs,Tt,s − g1 Xs,Tt,s + E (F2 (u2 )) r, Xs,rt,s − (F1 (u1 )) r, Xs,rt,s dr
s
h i Z T h i
x,2 x,1 x,2 x,1
= E g2 (Xt,T ) − g1 (Xt,T ) + E (F2 (u2 )) r, Xt,r − (F1 (u1 )) r, Xt,r dr
s
h i Z T h i
x,2 x,2 x,2 x,2
≤ E g2 (Xt,T ) − g1 (Xt,T ) + E (F2 (u2 )) r, Xt,r − (F1 (u2 )) r, Xt,r dr
s
h i Z T h i
x,2 x,1 x,2 x,1
+ E g1 (Xt,T ) − g1 (Xt,T ) + E (F1 (u2 )) r, Xt,r − (F1 (u1 )) r, Xt,r dr.
s
(53)
This, (36), and (38) ensure for all t ∈ [0, T ], s ∈ [t, T ], x ∈ Rd that
h i
x,2 x,1
E u2 (s, Xt,s ) − u1 (s, Xt,s )
βp h i
1 x,1 x,2 x,1 x,2 x,2 q
≤ T − 2 bE ϕ(Xt,T ) + ϕ(Xt,T ) Xt,T − Xt,T + δE (ϕ(Xt,T ))
Z T h i q
Z T h i
x,2 q x,2 x,2 x,1
+δ E ϕ(Xt,r ) + u2 (r, Xt,r ) dr + c E u2 (r, Xt,r ) − u1 (r, Xt,r ) dr
s s
βp !
1 x,1 x,2 x,1 x,2
+ bT − 2 sup E ϕ(Xt,r ) + ϕ(Xt,r ) Xt,r − Xt,r
r∈[s,T ] (54)
βp !
1 x,1 x,2 x,1 x,2
≤ 2bT − 2 sup E ϕ(Xt,T ) + ϕ(Xt,T ) Xt,T − Xt,T
r∈[s,T ]
! !
h i h qi
x,2 q x,2
+ δ(T − s + 1) sup E (ϕ(Xt,r )) + δ(T − s) sup E u2 (r, Xt,r )
r∈[s,T ] r∈[s,T ]
Z T h i
x,2 x,1
+c E u2 (r, Xt,r ) − u1 (r, Xt,r ) dr.
s
This, Lemma 2.1 (ii) (applied with κ x q in the notation of Lemma 2.1), (52), and Lemma 2.3 hence
11
ensure for all x ∈ Rd , t ∈ [0, T ], s ∈ [t, T ] that
h i 4 3 2 1 1
x,2 x,1
E u2 (s, Xt,s ) − u1 (s, Xt,s ) ≤ δ8e(q((2q−1)c +3c ))(T −t)+c +c+ 2 (ϕ(x))q+ 2
4
q
+3c3 )(T −t)
+ δ(T − s + 1)e 2 ((q−1)c (ϕ(x))q
q 4 3 q−1 q q
c (T −t)q
+ δ(T − s)2q bq e 4 ((q−1)c +3c )(T −t)+2 (ϕ(x)) 2
Z T h i
x,2 x,1 (55)
+c E u2 (r, Xt,r ) − u1 (r, Xt,r ) dr
s
4 3
q+2 q q q
T q +(c+1)2 1
≤ δ2 b (T + 1)e(q((2q−1)c +3c ))(T −t)+2 c (ϕ(x))q+ 2
Z T h i
x,2 x,1
+c E u2 (r, Xt,r ) − u1 (r, Xt,r ) dr
s
This and Gronwall’s inequality ensure for all x ∈ Rd , t ∈ [0, T ], s ∈ [t, T ] that
h i 4 3 q q q 2 1
x,2 x,1
E u2 (s, Xt,s ) − u1 (s, Xt,s ) ≤ δ2q+2 bq (T + 1)eq(2qc +3c )(T −t)+2 c T +(c+1) (ϕ(x))q+ 2 . (56)
θ
let Un,M : [0, T ] × Rd × Ω → R, n, M ∈ Z, θ ∈ Θ, satisfy for all θ ∈ Θ, n ∈ N0 , M ∈ N, t ∈ [0, T ], x ∈ Rd
that
Mn
θ 1N (n) X
M M ,(θ,0,−i),x
Un,M (t, x) = g 2 Yt,T
M n i=1
n−l (58)
n−1
X T − t MX (θ,l,i)
M
− 1N (l)F2 Ul−1,M
(θ,−l,i) (θ,l,i) M ,(θ,l,i),x
+ F2 Ul,M Tt , Y (θ,l,i) .
M n−l i=1 t,Tt
l=0
12
Proof. First, [22, Theorem 4.2(iv)] (applied with f x f2 , g x g2 , µ x µ2 , σ x σ2 in the notation of [22,
Theorem 4.2(iv)]) ensures for all t ∈ [0, T ], x ∈ Rd , n ∈ N0 , M ∈ N that
" #
h
0 2
i 21 exp(2ncT + M 2 ) 1 β+1 3
E Un,M (t, x) − u2 (t, x) ≤ n + M 12bc2 |ϕ(x)| p e9c T . (60)
M2 M 2
This, the triangle inequality, and Lemma 2.4 ensure for all t ∈ [0, T ], x ∈ Rd , n ∈ N0 , M ∈ N that
h 2
i 12 h 2
i 21
0 0
E Un,M (t, x) − u1 (t, x) ≤ E Un,M (t, x) − u2 (t, x) + |u2 (t, x) − u1 (t, x)|
" #
exp(2ncT + M 2 ) 1 β+1 3
≤ n + M 12bc2 |ϕ(x)| p e9c T
M2 M 2
+ δ2q+2 bq (T + 1)eq(2qc
4
+3c3 )(T −t)+2q cq T q +(c+1)2 1
(ϕ(x))q+ 2 (61)
4 3 q q q 2
q+ 21
≤ 4q bq c2 (T + 1)eq(2qc +3c )T +2 c T +(c+1) (ϕ(x))
" #
exp(2ncT + M 2 ) 1
· δ+ n + M .
M2 M 2
where T1 , ..., TL are affine-linear mappings and A refers to a nonlinear mapping that is applied componen-
twise, which is called activation function. Note that the last affine-linear mapping TL is not followed by an
activation function. For every k ∈ {1, ..., L−1} the composition (A◦Tk ) is called hidden layer, while TL is
called output layer. Similarly, we call the input of the function in (62) as input layer. We will only consider
compositions, that are endowed with the rectified linear unit (ReLU) activation function, i.e., for all d ∈ N
we have Ad : Rd → Rd satisfying for all x = (x1 , ..., xd ) ∈ Rd that Ad (x) = (max{0, x1 }, ..., max{0, xd }).
In this case, the function in (62) is characterized by the affine-linear
mappings involved, which in turn are
of the form Tk (x) = Wk x+Bk , where (Wk , Bk ) ∈ Rlk ×lk−1 × Rlk , x ∈ Rlk−1 , k ∈ {1, ..., L}, l0 , ..., lL ∈ N,
and thus the function in (62) is characterized by (W1 , B1 ) , ..., (WL , BL ). We refer to the entries of the
matrices Wk , k ∈ {1, ..., L}, as weights, to the entries of the vectors Bk , k ∈ {1, ..., L}, as biases, and to
the collection ((W1 , B1 ) , ..., (WL , BL )) as neural network. If L ∈ N ∩ [2, ∞), we call the network deep.
The set of all deep neural networks in our consideration can thus be defined as
"H+1 #
[ [ Y
kn ×kn−1 kn
N= R ×R . (63)
H∈N (k0 ,...,kH+1 )∈NH+2 n=1
13
For a given Φ ∈ N, we refer to its corresponding function with R(Φ).
In practice DNNs are employed in a host of machine learning related applications. In supervised learning
the goal is to choose a Φ ∈ N in such a way, that for a given set of datapoints (xp , yp )p∈[1,P ]∩N :
[1, P ] ∩ N → Rd × Rm , d, m, P ∈ N, we have for all p ∈ [1, P ] ∩ N that (R(Φ))(xp ) ≈ yp . Typically,
one fixes the shape for a network and then aims to minimize a loss function, usually the sum of squared
errors, with respect to the choice of weights and biases. Due to the large number of parameters involved
in even small networks, this minimization problem is of high dimension and is additionally highly non-
convex. Therefore, classical minimization techniques such as gradient descent or Newton’s method are
not suitable. To overcome this issue, stochastic gradient descent methods are employed. These seem to
work well, although there exist relatively few theoretical results which prove convergence of stochastic
gradient descent methods, see, e.g., [14]. For an overview on deep learning-based algorithms for PDE
approximations, see [5].
14
3.2 Properties of the proposed deep neural networks
For the proof of our main result in this section, Lemma 3.10, we need several auxiliary lemmas, which
are presented in this section. The proofs of Lemmas 3.3-3.5 appear in [23] and are therefore omitted.
Lemma 3.3. ([23, Lemma 3.5]) Assume Setting 3.2, let H, k, l ∈ N, and let α, β ∈ {k} × NH × {l}. Then
it holds that |||α ⊞ β||| ≤ |||α||| + |||β|||.
Lemma 3.4. ([23, Lemma 3.8]) Assume Setting 3.2 and let d1 , d2 , d3 ∈ N, f ∈ C(Rd2 , Rd3 ), g ∈
C(Rd1 , Rd2 ), α, β ∈ D satisfy that f ∈ R({Φ ∈ N : D(Φ) = α}) and g ∈ R({Φ ∈ N : D(Φ) = β}).
Then it holds that (f ◦ g) ∈ R({Φ ∈ N : D(Φ) = α ⊙ β}).
Lemma 3.5. ([23, Lemma 3.9]) Assume Setting 3.2 and let M, H, p, q ∈ N, h1 , h2 , ..., hM ∈ R, ki ∈
D, fi ∈ C(Rp , Rq ), i ∈ [1, M ] ∩ N, satisfy for all i ∈ [1, M ] ∩ N that dim(ki ) = H + 2 and fi ∈
R ({Φ ∈ N : D(Φ) = ki }) . Then it holds that
M
X
hi f i ∈ R Φ ∈ N : D(Φ) = ⊞M
i=1 ki . (72)
i=1
The following lemma extends [24, Lemma 5.4] as well as [23, Lemma 3.6], and proves the existence of a
DNN of arbitrary length representing the identity function.
Lemma 3.6. Assume Setting 3.2 and let d, H ∈ N. Then it holds that
let φ ∈ N satisfy that φ = ((W1 , B1 ), (W2 , B2 ), ..., (WH , BH ), (WH+1 , BH+1 )), for every a ∈ R let a+ =
max{a, 0}, and let x0 = (x01 , ..., x0d ) ∈ Rd , x1 , x2 , ..., xH ∈ R2d satisfy for all n ∈ N ∩ [1, H] that
15
and continuing this procedure yields
⊤
xH = A2d (WH xH−1 + BH ) = A2d (x01 )+ , (−x01 )+ , ..., (x0d )+ , (−x0d )+
⊤ (79)
= (x01 )+ , (−x01 )+ , ..., (x0d )+ , (−x0d )+ .
This combined with the fact that dim(D(Ψ)) = dim(D(Φg )) + 1 completes the proof.
16
Proof. Note that for all s ∈ [0, T ], k ∈ [1, K] ∩ N we have
τk−1 , s ≤ τk−1
(s ∨ τk−1 ) ∧ τk = s , s ∈ (τk−1 , τk ] . (84)
τk , s > τk
Next for all s ∈ [0, T ], k ∈ [1, K] ∩ N let Φks ∈ C(Rd , Rd ) satisfy for all x ∈ Rd that
Note that for all k ∈ [1, K − 1] ∩ N, s ≤ τk−1 it holds Φks = IdRd . This ensures for all k ∈ [1, K − 1] ∩ N, n ∈
[k + 1, K] ∩ N, s ≤ τk that
Ψks = Ψns , (88)
and in particular Ψks = ΨK d k
s . Observe that for all k ∈ [1, K] ∩ N, s ≤ τk , x ∈ R we have Ψs (x) = gs (x)
d K
and thus, for all s ∈ [0, T ], x ∈ R it holds Ψs (x) = gs (x).
Next, Lemma 3.6 (applied with d x d, H x dim(D(Φσ,0 )) − 2 in the notation of Lemma 3.6) ensures for
all s ∈ [0, T ] that n o
IdRd ∈ R Φ ∈ N : D(Φ) = nddim(D(Φσ,0 )) . (89)
Note that Lemma 3.5 (applied for all s ∈ [0, T ], k ∈ [1, K] ∩ N with M x 2, p x d, q x d, H x
dim(D(Φσ,0 )), h1 x 1, h2 x 1, f1 x IdRd , f2 x σ(·) [f ((s ∨ τk−1 ) ∧ τk ) − f (τk−1 )], k1 x nddim(D(Φσ,0 )) , k2 x
D(Φσ,0 ) in the notation of Lemma 3.5) proves for all s ∈ [0, T ], k ∈ [1, K] ∩ N that
n o
Φks ∈ R Φ ∈ N : D(Φ) = nddim(D(Φσ,0 )) ⊞ D(Φσ,0 ) . (90)
This and Lemma 3.4 (applied for all s ∈ [0, T ] with d1 x d, d2 x d, d3 x d, f x Φ2s , g x Φ1s ,
α x nddim(D(Φσ,0 )) ⊞ D(Φσ,0 ), β x nddim(D(Φσ,0 )) ⊞ D(Φσ,0 ) in the notation of Lemma 3.4) show for all
s ∈ [0, T ] that
i
2 h
Ψ2s = Φ2s ◦ Φ1s ∈ R Φ ∈ N : D(Φ) = ⊙ nddim(D(Φσ,0 )) ⊞ D(Φσ,0 ) . (91)
l=1
This and Lemma 3.4 (applied for all s ∈ [0, T ] with d1 x d, d2 x d, d3 x d, f x Φ3s , g x Ψ2s ,
2 h i
α x nddim(D(Φσ,0 )) ⊞ D(Φσ,0 ), β x ⊙ nddim(D(Φσ,0 )) ⊞ D(Φσ,0 ) in the notation of Lemma 3.4) show for
l=1
all s ∈ [0, T ] that
i
3 h
Ψ3s = Φ3s ◦ Ψ2s ∈R d
Φ ∈ N : D(Φ) = ⊙ ndim(D(Φσ,0 )) ⊞ D(Φσ,0 ) . (92)
l=1
17
Continuing this procedure hence demonstrates that for all s ∈ [0, T ], k ∈ [1, K] ∩ N it holds
i
k h
k d
Ψs ∈ R Φ ∈ N : D(Φ) = ⊙ ndim(D(Φσ,0 )) ⊞ D(Φσ,0 ) . (93)
l=1
(ii) for all t1 , t2 ∈ [0, T ], s1 ∈ [t1 , T ], s2 ∈ [t2 , T ], θ1 , θ2 ∈ Θ it holds that D(Ytθ11,s1 ) = D(Ytθ22,s2 ),
θ
(iii) for all t ∈ [0, T ], s ∈ [t, T ], θ ∈ Θ it holds that dim(D(Yt,s )) = K(max{dim(D(Φµ )), dim(D(Φσ,0 ))}−
1),
θ
(iv) for all t ∈ [0, T ], s ∈ [t, T ], θ ∈ Θ it holds that D(Yt,s ) ≤ max{2d, |||D(Φµ )|||, |||D(Φσ,0 )|||}.
Proof. Without loss of generality assume that dim(Φµ ) = dim(D(Φσ,0 )), otherwise we may refer to
Lemma 3.7. Let Σ ∈ C(Rd , Rd×(d+1) ) satisfy for all x ∈ Rd , i ∈ {1, ..., d}, j ∈ {1, ..., d + 1} that
(
µi (x) ,j = 1
(Σ(x))i,j = . (96)
σi,j−1 (x) , else
18
For all θ ∈ Θ let f θ : R → Rd+1 satisfy for all t ∈ [0, T ] that
t
f θ (t) = . (99)
Wtθ (ω)
Observe that for all x ∈ Rd , t ∈ [0, T ], θ ∈ Θ it holds
Σ(x)f θ (t) = µ(x)t + σ(x)Wtθ (ω). (100)
Hence Lemma 3.8 (applied for all t ∈ [0, T ], s ∈ [t, T ]θ ∈ Θ with d x d, m x d + 1, K x K, τ0 x τ0 ∨ t,
θ,·
τ1 x τ1 ∨ t,..., τK x τK ∨ t, σ x Σ, f x f θ , gs x Yt,s in the notation of Lemma 3.8) shows for all
θ
t ∈ [0, T ], s ∈ [t, T ], θ ∈ Θ the existence of a Yt,s ∈ N such that
θ θ,x
(I) R(Yt,s ) ∈ C(Rd , Rd ), satisfying for all x ∈ Rd that (R(Yt,s
θ
))(x) = Yt,s (ω),
K h i
θ
(II) D(Yt,s ) = ⊙ nddim(D(Φσ,0 )) ⊞ D(Φµ ) ⊞ D(Φσ,0 ) .
l=1
θ
Observe that D(Yt,s ) does not depend on t, s or θ, which thus implies item (ii). Item (iii) follows from
item (II) and the definition of ⊙.
Note that for all H1 , H2 , α0 , ..., αH1 +1 , β0 , ..., βH2 +1 ∈ N, α, β ∈ D satisfying α = (α0 , ..., αH1 +1 ) and
β = (β0 , ..., βH2 +1 ) with α0 = βH2 +1 it holds that |||α ⊙ β||| ≤ max{|||α|||, |||β|||, 2α0 }. This combined with
item (II) imply item (iv). The proof is thus completed.
19
and let ω ∈ Ω. Then for all n ∈ N0 there exists a family (Φθn,t )θ∈Θ,t∈[0,T ] ⊆ N such that
(i) it holds for all t1 , t2 ∈ [0, T ], θ1 , θ2 ∈ Θ that
D(Φθn,t
1
1
) = D(Φθn,t
2
2
) (104)
We will prove this lemma via induction on n ∈ N. For n = 0 we have for all θ, ∈ Θ, x ∈ Rd , t ∈ [0, T ] that
U0θ (t, x, ω) = 0. By choosing all weights and biases to be 0, the 0-function can be represented by deep
neural networks of arbitrary shape. Therefore items (i)-(iv) hold true for n = 0. For the induction step
θ
n → n + 1 we will construct Un+1 (t, x), θ ∈ Θ, t ∈ [0, T ], x ∈ Rd , via DNN functions, which we achieve
by representing each summand of (103) by a DNN, and then, in order to be able to utilize Lemma 3.5,
compose each DNN with a suitable sized network representing the identity function on R. Lemma 3.9
θ
yields the existence of a family (Yt,s )θ∈Θ,s,t∈[0,T ] ⊆ N satisfying for all θ ∈ Θ, t ∈ [0, T ], s ∈ [t, T ] that
θ θ,x θ θ θ
(R(Yt,s ))(x) = Yt,s (ω), dim(D(Yt,s )) = Y, D(Yt,s ) ≤ max{2d, |||D(Φµ )|||, |||D(Φσ,0 )|||}, and D(Yt,s )=
0 d
D(Y0,0 ). Lemma 3.4 (applied for all θ ∈ Θ, x ∈ R with d1 x d, d2 x d, d3 x 1, α x D(Φg ), β x
θ θ,x
D(Yt,T ), f x g, g x Yt,T in the notation of Lemma 3.4) ensures for all θ ∈ Θ, x ∈ Rd , that
θ,x θ
g ◦ Yt,T ∈R Φ ∈ N : D(Φ) = D(Φg ) ⊙ D(Yt,T ) . (109)
Note that Lemma 3.6 (applied with H x (n + 1)(dim(D(Φf )) − 2 + Y) + 1 in the notation of Lemma 3.6)
ensures that
IdR ∈ R Φ ∈ N : D(Φ) = n(n+1)(dim(D(Φf ))−2+Y)+1 . (110)
This, (109), and Lemma 3.4 (applied for all t ∈ [0, T ], θ, ∈ Θ, x ∈ Rd with d1 x d, d2 x 1, d3 x 1, f =
θ,x θ
IdR , g = g ◦ Yt,T , α = n(n+1)(dim(D(Φf ))−2+Y)+1 , β = D(Yt,T ) in the notation of Lemma 3.4) show that
d
for all t ∈ [0, T ], θ ∈ Θ, x ∈ R it holds that
θ,x θ
g ◦ Yt,T ∈R Φ ∈ N : D(Φ) = n(n+1)(dim(D(Φf ))−2+Y)+1 ⊙ D(Φg ) ⊙ D(Yt,T ) . (111)
Next, the induction hypothesis implies for all l ∈ [0, n] ∩ N0 , t ∈ [0, T ], θ ∈ Θ the existence of Φθl,t ∈ N
satisfying R(Φθl,t ) = Ulθ (t, ·, ω) and D(Φθl,t ) = D(Φ0l,0 ). This and Lemma 3.4 (applied for all θ, η ∈ Θ, t ∈
θ,x η
[0, T ], l ∈ [0, n] ∩ N0 with d1 x d, d2 x d, d3 x 1, f x Unη (Tθt (ω), ·, ω), g x Yt,T θ (ω) , α x D(Φl,Tθ ),
t t
20
θ
β x D(Yt,Tθ (ω) ) in the notation of Lemma 3.4) show that for all θ, η ∈ Θ, t ∈ [0, T ], l ∈ [0, n] ∩ N0 it holds
t
that
Ulη Tθt (ω), Yt,T
θ,x
θ (ω), ω = R(Φ η
θ ) (R(Y θ
θ
t,Tt (ω) ))(x)
t (ω) l,T (ω)
n t o
∈ R Φ ∈ N : D(Φ) = D(Φηl,Tθ (ω) ) ⊙ D(Yt,T θ
θ (ω) )
(112)
t t
= R Φ ∈ N : D(Φ) = D(Φ0l,0 ) ⊙ D(Y0,0 0
) .
Note that Lemma 3.6 (applied for all l ∈ [1, n − 1] ∩ N with H x (n − l) (dim(D(Φf )) − 2 + Y)) + 1 in
the notation of Lemma 3.6) shows for all l ∈ [1, n − 1] ∩ N that
IdR ∈ R Φ ∈ N : D(Φ) = n(n−l)(dim(D(Φf ))−2+Y)+1 . (113)
Furthermore, (112) (applied with l x n) and Lemma 3.4(applied for all θ, η ∈ Θ, t ∈ [0, T ] with d1 x d,
θ,x
d2 x 1, d3 x 1, f x f , g x Un Tθt (ω), Yt,T
η 0 0
θ (ω) (ω), ω , α x D(Φf ), β x D(Φn,0 ) ⊙ D(Y0,0 ) in the
t
notation of Lemma 3.4) prove that for all η, θ ∈ Θ, t ∈ [0, T ] it holds
θ,x
f ◦ Unη Tθt (ω), Yt,Tθ (ω) (ω), ω ∈ R Φ ∈ N : D(Φ) = D(Φf ) ⊙ D(Φ0l,0 ) ⊙ D(Y0,00
) . (116)
t
θ
To reiterate what was established so far: For each summand of Un+1 , θ ∈ Θ, existence of a suitable DNN
representation was proven. The next step shows that these representing DNNs all have the same depth:
θ
Note that Lemma 3.9 implies that for all t ∈ [0, T ], s ∈ [t, T ], θ ∈ Θ it holds that Y = dim(D(Yt,s )). The
definition of ⊙, the fact that for all l ∈ [0, n] ∩ N0 we have
21
that
dim(D(Φf ) ⊙ D(Φ0n,0 ) ⊙ D(Y0,0
0
))
= dim(D(Φf )) + dim(D(Φ0n,0 )) + dim(D(Y0,0
0
)) − 2
(119)
= dim(D(Φf )) + n(dim(D(Φf )) − 2) + (n + 1)Y + dim(D(Φg )) − 1 + Y − 2
= (n + 1)(dim(D(Φf )) − 2) + (n + 2)Y + dim(D(Φg )) − 1,
and for all l ∈ [0, n − 1] ∩ N0 that
dim(D(Φf ) ⊙ n(n−l)(dim(D(Φf ))−2+Y)+1 ⊙ D(Φ0l,0 ) ⊙ D(Y0,0
0
))
= dim(D(Φf )) + dim(n(n−l)(dim(D(Φf ))−2+Y)+1 ) + dim(D(Φ0l,0 )) + dim(D(Y0,0
0
)) − 3
= dim(D(Φf )) + (n − l) (dim(D(Φf )) − 2 + Y) + 1 + l(dim(D(Φf )) − 2) + (l + 1)Y (120)
+ dim(D(Φg )) − 1 + Y − 3
= (n + 1)(dim(D(Φf )) − 2) + (n + 2)Y + dim(D(Φg )) − 1.
θ
Since all DNNs that are necessary to construct Un+1 , θ ∈ Θ, have the same depth, Lemma 3.5 and (103)
imply that there exist (Φn+1,t )t∈[0,T ],θ∈Θ ⊆ N satisfying for all θ ∈ Θ, t ∈ [0, T ], x ∈ Rd it holds that
θ
(R(Φθn+1,t ))(x)
n+1
M
1 X (θ,0,−i),x
= g(Yt,T (ω))
M n+1 i=1
M
(T − t) X (θ,n,i) (θ,n,i)
+ f ◦ Un(θ,n,i) Tt (ω), Y (θ,n,i) (ω), ω
M i=1 t,Tt (ω)
n+1−l
(121)
n−1 MX
X (T − t) (θ,l,i) (θ,l,i) (θ,l,i)
+ f◦ Ul Tt (ω), Y (θ,l,i) (ω), ω
M n+1−l i=1
t,T t (ω)
l=0
n+1−l
n MX
X (T − t) (θ,−l,i) (θ,l,i) (θ,l,i)
− f◦ Ul−1 Tt (ω), Y (θ,l,i) (ω), ω
M n+1−l i=1
t,Tt (ω)
l=1
θ
= Un+1 (t, x, ω),
that
dim(D(Φθn+1,t )) = (n + 1)(dim(D(Φf )) − 2) + (n + 2)Y + dim(D(Φg )) − 1, (122)
and that
!
M n+1
D(Φθn+1,t ) = ⊞ n(n+1)(dim(D(Φf ))−2+Y)+1 ⊙ D(Φg ) ⊙ 0
D(Y0,0 )
i=1
M
⊞ ⊞ D(Φf ) ⊙ D(Φ0n,0 ) ⊙ 0
D(Y0,0 )
i=1
! (123)
n−1 M n+1−l
⊞ ⊞ ⊞ D(Φf ) ⊙ n(n−l)(dim(D(Φf ))−2+Y)+1 ⊙ D(Φ0l,0 ) ⊙ D(Y0,0
0
)
l=0 i=1
!
n M n+1−l
⊞ ⊞ ⊞ D(Φf ) ⊙ n(n−l+1)(dim(D(Φf ))−2+Y)+1 ⊙ D(Φ0l−1,0 ) ⊙ D(Y0,0
0
) .
l=1 i=1
22
Additionally, (123) and Lemma 3.3 prove for all t ∈ [0, T ], θ ∈ Θ that
n+1
M
X
D(Φθn+1,t ) ≤ 0
n(n+1)(dim(D(Φf ))−2+Y)+1 ⊙ D(Φg ) ⊙ D(Y0,0 )
i=1
M
X
+ D(Φf ) ⊙ D(Φ0n,0 ) ⊙ D(Y0,0
0
)
i=1
n+1−l
(125)
n−1
X MX
+ D(Φf ) ⊙ n(n−l)(dim(D(Φf ))−2+Y)+1 ⊙ D(Φ0l,0 ) ⊙ 0
D(Y0,0 )
l=1 i=1
n+1−l
n MX
X
+ D(Φf ) ⊙ n(n−l+1)(dim(D(Φf ))−2+Y)+1 ⊙ D(Φ0l−1,0 ) ⊙ D(Y0,0
0
) .
l=1 i=1
Note that for all H1 , H2 , α0 , ..., αH1 +1 , β0 , ..., βH2 +1 ∈ N, α, β ∈ D satisfying α = (α0 , ..., αH1 +1 ) and
β = (β0 , ..., βH2 +1 ) with α0 = βH2 +1 it holds that |||α ⊙ β||| ≤ max{|||α|||, |||β|||, 2α0 }. This, (125), the fact
that for all H ∈ N it holds that |||nH+2 ||| = 2, and the fact that for all l ∈ [0, n] ∩ N0 we have
(121),(122), (124) and (127) complete the induction step. By the principle of induction the proof is thus
completed.
23
qP
d 2
i,j=1 (ai,j ) , let h·, ·i : (∪d∈N Rd × Rd ) → R satisfy for all d ∈ N, x = (x1 , ..., xd ), y = (y1 , ..., yd ) ∈ Rd
Pd d d d
that hx, yi = i=1 xi yi , for all d ∈ N let gd ∈ C(R , R), µd = (µd,i )i∈{1,...,d} ∈ C(R , R ), σd =
d d×d
(σd,i,j )i,j∈{1,...,d} ∈ C(R , R ), let f ∈ C(R, R), assume for all v, w ∈ R that |f (v) − f (w)| ≤ c|v − w|,
for every ε ∈ (0, 1], d ∈ N, v ∈ Rd let σ̂d,ε ∈ C(Rd , Rd×d ), gd,ε , µ̃d,ε , σ̃d,ε,v ∈ N, assume for all d ∈
N, x, y, v ∈ Rd , ε ∈ (0, 1] that R(gd,ε ) ∈ C(Rd , R), R(µ̃d,ε ) ∈ C(Rd , Rd ), R(σ̃d,ε,v ) ∈ C(Rd , Rd ), R(σ̃d,ε,v ) =
σ̂d,ε (·)v, D(σ̃d,ε,v ) = D(σ̃d,ε,0 ),
1
|(R(gd,ε ))(x)| ≤ b(d2c + kxk2 ) 2 , max{k(R(µ̃d,ε ))(0)k , kσ̂d,ε (0)kF } ≤ cdc (128)
2
max |gd (x) − (R(gd,ε ))(x)|, kµd (x) − (R(µ̃d,ε )) (x)k , kσd,ε (x) − σ̂d,ε (x)kF ≤ εBdp (d2c +kxk )q , (129)
q
− 12 2 2
|(R(gd,ε ))(x) − (R(gd,ε ))(y)| ≤ bT 2d2c + kxk + kyk kx − yk , (130)
max k(R(µ̃d,ε ))(x) − (R(µ̃d,ε ))(y)k , kσ̂d,ε (x) − σ̂d,ε (y)kF ≤ c kx − yk , (131)
max {|||D(gd,ε )|||, |||D(µ̃d,ε )|||, |||D(σ̃d,ε,0 )|||} ≤ Bdp ε−α , (132)
p −β
max {dim(D(gd,ε )), dim(D(µ̃d,ε )), dim(D(σ̃d,ε,0 ))} ≤ Bd ε , (133)
R 2q+1 1
and for all d ∈ N let νd be a probability measure on (Rd , B(Rd )) satisfying ( Rd kyk νd (dy)) 2q+1 ≤ Bdp .
Then
(i) for every d ∈ N there exists a unique viscosity solution ud ∈ {u ∈ C([0, T ] × Rd , R) :
sup sup |u(s,y)|
1+kyk < ∞} of
s∈[0,T ]y∈Rd
∂ 1
( ud )(t, x) + hµd (x), (∇x ud )(t, x)i + Tr(σd (x)[σd (x)]∗ (Hessx ud )(t, x)) + f (ud (t, x)) = 0 (134)
∂t 2
with ud (T, x) = gd (x) for t ∈ (0, T ), x ∈ Rd ,
(ii) there exist C = (Cγ )γ∈(0,1] : (0, 1] → (0, ∞), η ∈ (0, ∞), (Ψd,ε )d∈N,ε∈(0,1] ⊆ N such that for all
d ∈ N, ε ∈ (0, 1] it holds that R(Ψd,ε ) ∈ C(Rd , R), P(Ψd,ε ) ≤ Cγ dη ε−(6+2α+β+γ) , and
Z 21
2
|ud (0, x) − (R(Ψd,ε ))(x)| νd (dx) ≤ ε. (135)
Rd
Note that the triangle inequality, (129) and (131) imply for all d ∈ N, x ∈ Rd , ε ∈ (0, 1] that
kµd (x) − µd (y)k ≤ kµd (x) − (R(µ̃d,ε ))(x)k + k(R(µ̃d,ε ))(x) − (R(µ̃d,ε ))(y)k
+ k(R(µ̃d,ε ))(y) − µd (y)k (137)
2 2
≤ εBdp (d2c + kxk )q + (d2c + kyk )q + c kx − yk .
24
In the same manner it follows for all d ∈ N, x ∈ Rd that
and q
1 2 2
|gd (x) − gd (y)| ≤ bT − 2 2d2c + kxk + kyk kx − yk . (140)
Note that due to the triangle inequality, (128), and (129) it holds for all d ∈ N, ε ∈ (0, 1] that
max{kµd (0)k , kσd (0)kF } ≤ max{kµd (0) − (R(µ̃d,ε ))(0)k , kσd (0) − σ̂d,ε (0)kF } + cdc
(141)
≤ εBdp+2qc + cdc ,
Next, the triangle inequality, (128), and (129) ensure for all d ∈ N, x ∈ Rd , ε ∈ (0, 1] that
2 2 1
|gd (x)| ≤ |gd (x) − (R(gd,ε ))(x)| + |(R(gd,ε ))(x)| ≤ εBdp (d2c + kxk )q + b(d2c + kxk ) 2 , (143)
Next, [23, Corollary 3.13] (applied for all ε ∈ (0, 1] with L x c, q x q, f x f, ε x ε in the notation of
[23, Corollary 3.13]) ensures existence of fε ∈ N, ε ∈ (0, 1], which satisfy for all v, w ∈ R, ε ∈ (0, 1] that
R(fε ) ∈ C(R, R), |(R(fε ))(v) − (R(fε ))(w)| ≤ c|v − w|, |f (v) − (R(fε ))(v)| ≤ ε(1 + |v|q ), dim(D(fε )) = 3
and n o q 1 − q−1
|||D(fε )||| ≤ 16 max 1, |c(4c + 2|f (0)|)| (q−1) . (145)
Due to the fact that B ≥ 1 + |f (0)|, it holds for all ε ∈ (0, 1] that
b b + BT
max{|f (0)|, |(R(fε ))(0)|} ≤ +B = . (147)
T T
2
Next, for all d ∈ N let ϕd ∈ C 2 (Rd , [1, ∞)) satisfy for all x ∈ Rd that ϕd (x) = d2c + kxk . Hence for all
d ∈ N, x ∈ Rd we have
(∇x ϕd )(x) = 2x and (Hessx ϕd )(x) = 2IdRd . (148)
By the Cauchy-Schwartz inequality it thus holds for all d ∈ N, x ∈ Rd , z ∈ Rd \ {0} that
25
2
Next, Jensen’s inequality ensures that for
√ all d ∈ N, x ∈ Rd we have (dc + kxk)2 ≤ 2(d2c + kxk ) = 2ϕd (x),
1
which in turn implies that dc + kxk ≤ 2(ϕd (x)) 2 . This and (141) ensure for all d ∈ N, x ∈ Rd , ε ∈ (0, 1]
that
c kxk + max{kµd (0)k , kσd (0)kF , k(R(µ̃d,ε ))(0)k , kσ̂d,ε (0)kF } c(dc + kxk) √
1 ≤ 1 ≤ 2c ≤ 2c. (151)
(ϕd (x)) 2 (ϕd (x)) 2
Next, note that for all d ∈ N it holds that
|f (0)| + |gd (x)|
sup [ inf ϕd (x)] = ∞ and inf sup = 0. (152)
d
r∈(0,∞) x∈R ,kxk≥r r∈(0,∞) x∈Rd ,kxk>r ϕd (x)
Lemma 2.1 (i) (applied with for all d ∈ N with d x d, m x d, c x 2c, κ x 1, p x 2, ϕ x ϕd , µ x µd ,
σ x σd in the notation of Lemma 2.1) ensures for all d ∈ N, x ∈ Rd that
1
hµ(x), (∇ϕd )(x)i + Tr(σ(x)[σ(x)]∗ (Hessϕd )(x)) ≤ 12c3 . (153)
2
Analogously we get for all d ∈ N, x ∈ Rd , ε ∈ (0, 1] that
1
h(R(µ̃d,ε ))(x), (∇ϕd )(x)i + Tr(σ̂d,ε (x)[σ̂d,ε (x)]∗ (Hessϕd )(x)) ≤ 12c3 . (154)
2
Let (Ω, F , P, (Ft )t∈[0,T ] ) be a filtered probability space satisfying the usual conditions, for all d ∈ N
let W d : [0, T ] × Ω → Rd be a standard (Ft )t∈[0,T ] -Brownian motion. Then [4, Corollary 3.2](applied
twice: the first application for all d ∈ N with d x d, m x d, ,T x T , L x c, ρ x 12c3 , O x Rd ,
µ x ([0, T ] × Rd ∋ (t, x) → µd (x) ∈ Rd ), σ x ([0, T ] × Rd ∋ (t, x) → σd (x) ∈ Rd×d ), g x gd ,
f x ([0, T ] × Rd × R ∋ (t, x, v) → f (v) ∈ R), V x ϕd in the notation of [4, Corollary 3.2]; the
second application for all d ∈ N, ε ∈ (0, 1] with d x d, m x d, ,T x T , L x c, ρ x 12c3 , O x Rd ,
µ x ([0, T ] × Rd ∋ (t, x) → (R(µ̃d,ε ))(x) ∈ Rd ), σ x ([0, T ] × Rd ∋ (t, x) → σ̂d,ε (x) ∈ Rd×d ), g x R(gd,ε ),
f x R(fε ), V x ϕd in the notation of [4, Corollary 3.2]) guarantees that
(I) for every d ∈ N there exists a unique viscosity solution
|u(s,y)|
ud ∈ {u ∈ C([0, T ] × Rd , R) : lim supr→∞ [sups∈[0,T ] supx∈Rd ,kxk>r ϕd (x) ] = 0} of
∂ 1
( ud )(t, x) + hµd (x), (∇x ud )(t, x)i + Tr(σd (x)[σd (x)]∗ (Hessx ud )(t, x)) = −f (ud (t, x)) (155)
∂t 2
with ud (T, x) = gd (x) for t ∈ (0, T ), x ∈ Rd ,
(II) for every d ∈ N, x ∈ Rd , t ∈ [0, T ], ε ∈ (0, 1] there exist up to indistinguishability unique (Fs )s∈[t,T ] -
adapted stochastic processes Xtx,d = (Xt,s x,d
)s∈[t,T ] : [t, T ]×Ω → Rd and Xtx,d,ε = (Xt,s
x,d,ε
)s∈[t,T ] : [t, T ]×
Ω → Rd with continuous sample paths satisfying that for all s ∈ [t, T ] we have P-a.s. that
Z s Z s
x,d x,d x,d
Xt,s = x + µd (Xt,r )dr + σd (Xt,r )dWrd ,
t t
Z s Z s (156)
x,d,ε x,d,ε x,d,ε d
Xt,s = x + (R(µ̃d,ε ))(Xt,r )dr + σ̂d,ε (Xt,r )dWr ,
t t
(III) for every d ∈ N, ε ∈ (0, 1] there exist unique vd ∈ C([0, T ] × Rd , R) and vd,ε ∈ C([0, T ] × Rd , R)
which satisfy for all t ∈ [0, T ], x ∈ Rd that
" # " #
vd (s, y) vd,ε (s, y) h i
x,d,ε
sup sup 1 + sup sup 1 + E (R(g d,ε ))(X t,T )
s∈[0,T ] y∈Rd (ϕd (y)) 2 s∈[0,T ] y∈Rd (ϕd (y)) 2
(157)
h i Z T h i Z T h i
x,d x,d x,d,ε
+ E gd (Xt,T ) + E f (vd (s, Xt,s )) ds + E (R(fε ))(vd,ε (s, Xt,s )) ds < ∞
t t
26
h i Z T h i
x,d x,d
vd (t, x) = E gd (Xt,T ) + E f (vd (s, Xt,s )) ds, (158)
t
h i Z T h i
x,d,ε x,d,ε
vd,ε (t, x) = E (R(gd,ε ))(Xt,T ) + E (R(fε ))(vd,ε (s, Xt,s )) ds, (159)
t
Item (II) and Lipschitz continuity of µd , σd , R(µ̃d,ε ), σ̂d,ε , d ∈ N, ε ∈ (0, 1], and , e.g. [26, Lemma 13.6],
ensure for all d ∈ N, ε ∈ (0, 1], t ∈ [0, T ], s ∈ [t, T ], r ∈ [s, T ] that
x,d X x,d ,d
x,d,ε X x,d,ε ,d,ε
P(Xt,r = Xs,rt,s ) = P(Xt,r = Xs,rt,s ) = 1. (160)
x,d
Next, item (III) ensures for all d ∈ N, x ∈ Rd , t ∈ [0, T ], s ∈ [t, T ] that ud (s, Xt,s ) is integrable. This,
(160), (144), (147), and Lemma 2.1 (ii) (applied for all d ∈ N with d x d, m x d, c x 2c, κ x 1, p x 2,
ϕ x ϕd , µ x µd , σ x σd in the notation of Lemma 2.1) prove for all d ∈ N, x ∈ Rd , t ∈ [0, T ], s ∈ [t, T ]
that
h i Z T
x,d X x,d ,d X x,d ,d
E ud (s, Xt,s ) ≤ E gd Xs,Tt,s + E f ud r, Xs,rt,s dr
s
h i Z i T h
x,d x,d
=E gd (Xt,T )
E f (ud (r, Xt,r
+ )) dr
s
" 1# Z T h i (161)
2 2
2c x,d x,d
≤ bE d + Xt,T + (T − s)|f (0)| + c E |ud (r, Xt,r )| dr
s
Z T h i
3 2 1 x,d
≤ be6c (T −t)
(d2c + kxk ) 2 + b + BT + c E |ud (r, Xt,r )| dr.
s
This and Gronwalls inequality ensure for all d ∈ N, x ∈ Rd , t ∈ [0, T ], s ∈ [t, T ] that
h i 3
x,d 2 1
E ud (s, Xt,s ) ≤ 2(b + BT )e7c (T −t) (d2c + kxk ) 2 . (162)
In order to prove item (ii), we will combine Lemma 3.10 with Corollary 2.5, for which we yet need a
suitable MLP setting.
Let Θ = ∪n∈N Zn , let tθ : Ω → [0, 1], θ ∈ Θ, be i.i.d. random variables, assume for all t ∈ (0, 1) that
P(t0 ≤ t) = t, let Tθ : [0, T ] × Ω → [0, T ] satisfy for all θ ∈ Θ, t ∈ [0, T ] that Tθt = t + (T − t)tθ ,
let W θ,d : [0, T ] × Ω → Rd , θ ∈ Θ, d ∈ N, be i.i.d. standard (Ft )t∈[0,T ] -Brownian motions, assume that
(tθ )θ∈Θ and (W θ,d )θ∈Θ,d∈N are independent, for every δ ∈ (0, 1], d ∈ N, N ∈ N, θ ∈ Θ, x ∈ Rd , t ∈ [0, T ] let
27
(n+1)T
YtN,d,θ,x,δ = (Yt,s
N,d,θ,x,δ
)s∈[t,T ] : [t, T ] × Ω → Rd satisfy for all d ∈ N, n ∈ {0, 1, ..., N }, s ∈ [ nT
N , N ]∩
N,d,θ,x,δ
[t, T ] that Yt,t = x and
N,d,θ,x,δ N,d,θ,x,δ
Yt,s − Yt,max{t, nT
}
N
nT (165)
N,d,θ,x,δ N,d,θ,x,δ θ,d θ,d
= (R(µ̃d,δ ))(Yt,max{t, nT )(s − max{t, }) + σ̂d,δ (Yt,max{t, nT )(Ws − Wmax{t, nT ),
N } N N } N }
θ
let Un,M,d,δ : [0, T ] × Rd × Ω → R, n, M ∈ Z, d ∈ N, θ ∈ Θ, δ ∈ (0, 1] satisfy for all d ∈ N, θ ∈ Θ, n ∈ N0 , t ∈
[0, T ], x ∈ Rd , δ ∈ (0, 1] that
n
θ 1N (n) X
M
M M ,d,(θ,0,−i),x,δ
Un,M,d,δ (t, x) = (R(gd,δ ))(Yt,T )
Mn i=1
n−l (166)
n−1 M
X (T − t) X M M ,d,(θ,l,i),x,δ
(R(fδ ) ◦ Ul,M,d − 1N (l)R(fδ ) ◦ Ul−1,M,d )(Tt
(θ,l,i) (θ,−l,i) (θ,l,i)
+ , Y (θ,l,i) ) .
M n−l i=1
t,Tt
l=0
ε
and for all d ∈ N, ε ∈ (0, 1] let δd,ε = 4Bdp cd . Observe that for all γ ∈ (0, 1] it holds that
1 6+γ
e[4cT + 2 ](n−1)+1
sup (3n)3n+1
Cγ ≤
n∈N∩[2,∞) (n − 1)(n−1)/2
(n−1)(6+γ)
3
4cT + 2
e
sup (3n)3n+1
≤ 1/2
n∈N∩[2,∞) (n − 1)
3
[4cT + 2 ](n−1)(6+γ) 3(n−1) (171)
4 3n+1 e n
= sup n 3 γ ·
n∈N∩[2,∞) (n − 1) 2 (n−1) n−1
(n−1)
3 " 3(n−1) #
[4cT + 2 ](6+γ)
4 3n+1 e n
≤ sup n 3 γ
sup
n∈N∩[2,∞) (n − 1) 2 n∈N∩[2,∞) n−1
< ∞.
28
Note that the definition of (cd )d∈N implies that there exists a c̃ ∈ [1, ∞) such that for all d ∈ N it holds
that
cd = c̃d(c+p)(q+1) . (172)
Corollary 2.5 (applied for every d, M ∈ N, n ∈ N0 , x ∈ Rd , δ ∈ (0, 1] with δ x δBdp , c x 2c, b x b +
BT, p x 2, q x q, ϕ x ϕd , f1 x ([0, T ]×Rd ×R ∋ (t, x, v) → f (v) ∈ R), f2 x ([0, T ]×Rd ×R ∋ (t, x, v) →
(R(fδ ))(v) ∈ R)), g1 x gd , g2 x R(gd,δ ), µ1 x µd , µ2 x R(µ̃d,δ ), σ1 x σd , σ2 x σ̂d,δ , u1 x ud in the
notation of Corollary 2.5) shows that for all d, M ∈ N, n ∈ N0 , x ∈ Rd , δ ∈ (0, 1] it holds that
0 1 4 3 q 2
E |Un,M,d,δ (0, x) − ud (0, x)|2 2 ≤ 4q+1 bq c2 (T + 1)eq(q32c +24c )T +(4cT ) +(2c+1)
" #
q+1
2 2 4
M
e4ncT + 2 1 (173)
2c p
· d + kxk δBd + n + M .
M 2 M 2
This, the definitions of (Nd,ε )d∈N,ε∈(0,1] , (δd,ε )d∈N,ε∈(0,1] , and Fubini’s theorem show for all d ∈ N, ε ∈ (0, 1]
that Z
0
E |UN d,ε ,Nd,ε ,d,δd,ε
(0, x) − ud (0, x)|2 νd (dx)
Rd
Z h i ε ε 2 (175)
0 2
= E |UN ,N ,d,δ (0, x) − u d (0, x)| ν d (dx) ≤ + ≤ ε2 .
Rd
d,ε d,ε d,ε
4 2
This shows that for all d ∈ N, ε ∈ (0, 1] there exists a ωd,ε ∈ Ω such that
Z
0
|UN d,ε ,Nd,ε ,d,δd,ε
(0, x, ωd,ε ) − ud (0, x)|2 νd (dx) ≤ ε2 . (176)
Rd
N
This and Lemma 3.10 (applied for all d ∈ N, v ∈ Rd , ε ∈ (0, 1], with c x kd,δd,ε , d x d, K x Nd,εd,ε ,
M x Nd,ε , T x T , Φµ x µ̃d,δd,ε , Φσ,v x σ̃d,δd,ε ,v , Φf x fδd,ε , Φg x gd,δd,ε in the notation of Lemma
3.10) imply that for all d ∈ N, ε ∈ (0, 1] there exists a deep neural network Ψd,ε ∈ N such that for
all x ∈ Rd it holds that R(Ψd,ε ) ∈ C(Rd , R), (R(Ψd,ε ))(x) = UN 0
d,ε ,Nd,ε ,d,δd,ε
(0, x, ωd,ε ), |||D(Ψd,ε )||| ≤
29
kd,δd,ε (3Nd,ε )Nd,ε , and
N
dim(D(Ψd,ε )) ≤ (Nd,ε + 1)Nd,εd,ε max{dim(D(µ̃d,δd,ε )), dim(D(σ̃d,ε,0 ))} − 1
+ Nd,ε + dim(D(gd,δd,ε )) − 1
−β N−β
≤ Nd,ε + Bdp δd,ε + (Nd,ε + 1)Nd,εd,ε (Bdp δd,ε − 1) − 1 (177)
−β −β
≤ Nd,ε Bdp δd,ε + (Nd,ε + 1)Nd,ε +1 (Bdp δd,ε )
Nd,ε +1 p −β
≤ 2(Nd,ε + 1) Bd δd,ε .
Hence we get for all d ∈ N, ε ∈ (0, 1] that
dim(D(Ψd,ε ))
X
P(Ψd,ε ) ≤ kd,δd,ε (3Nd,ε )Nd,ε (kd,δd,ε (3Nd,ε )Nd,ε + 1)
j=1
h i
2 2Nd,ε Nd,ε
= dim(D(Ψd,ε )) kd,δ d,ε
(3N d,ε ) + kd,δd,ε
(3N d,ε ) (178)
h i
Nd,ε +1 −β
≤ 2(Nd,ε + 1) Bdp δd,ε 2
2kd,δd,ε
(3N d,ε )2Nd,ε
−β 2
≤ 4Bdp δd,ε kd,δd,ε (3Nd,ε )3Nd,ε +1 .
Note that the definition of (kd,ε )d∈N,ε∈(0,1] implies for all d ∈ N, ε ∈ (0, 1] that
kd,ε = max{|||D(fε )|||, |||D(gd,ε )|||, |||D(µ̃d,ε )|||, |||D(σ̃d,ε,0 )|||, 2}
≤ max{Bε−2 , Bdp ε−α , 2} (179)
−α p −α p −α
≤ max{Bε , Bd ε , 2} ≤ Bd ε .
This shows that for all d ∈ N, ε ∈ (0, 1] it holds that
−β 2
P(Ψd,ε ) ≤ 4Bdp δd,ε kd,δd,ε (3Nd,ε )3Nd,ε +1
−2α−β
(180)
≤ 4B 3 d3p δd,ε (3Nd,ε )3Nd,ε +1 .
Next, note that the definition of (Nd,ε )d∈N,ε∈(0,1] implies that for all d ∈ N, ε ∈ (0, 1] it holds that
exp((4cT + 21 )(Nd,ε − 1)) + 1
ε ≤ 2cd . (181)
(Nd,ε − 1)(Nd,ε −1)/2
This and the definition of (Cγ )γ∈(0,1] imply for all d ∈ N, γ, ε ∈ (0, 1] that
−2α−β
P(Ψd,ε ) ≤ 4B 3 d3p δd,ε (3Nd,ε )3Nd,ε +1 ε6+γ ε−6−γ
6+γ
3 3p −2α−β −6−γ 3Nd,ε +1 exp((4cT + 12 )(Nd,ε − 1)) + 1
≤ 4B d δd,ε ε (3Nd,ε ) 2cd
(Nd,ε − 1)(Nd,ε −1)/2
!6+γ
(4cT + 12 )(n−1) (182)
−2α−β −6−γ 6+γ e +1
≤ 28+γ B 3 d3p δd,ε ε cd sup (3n)3n+1 n−1
n∈N∩[2,∞) (n − 1) 2
30
4.2 Deep neural network approximations with positive polynomial conver-
gence rates
This corollary is an application of Theorem 4.1, where for all d ∈ N we set the probability measure νd to
be the d-dimensional Lebesgue measure restricted on the d-dimensional unit cube [0, 1]d .
Corollary 4.2. Assume Setting 3.1, let c, T ∈q(0, ∞), let k·kF : (∪d∈N Rd×d ) → [0, ∞) satisfy for all
Pd
d ∈ N, A = (ai,j )i,j∈{1,...,d} ∈ Rd×d that kAkF = 2 d d
i,j=1 (ai,j ) , let h·, ·i : (∪d∈N R × R ) → R satisfy for
Pd
all d ∈ N, x = (x1 , ..., xd ), y = (y1 , ..., yd ) ∈ Rd that hx, yi = i=1 xi yi , let f ∈ C(R, R), for all d ∈ N let
ud ∈ C ([0, T ] × R , R), µd = (µd,i )i∈{1,...,d} ∈ C(R , R ), σd = (σd,i,j )i,j∈{1,...,d} ∈ C(Rd , Rd×d ) satisfy
1,2 d d d
for all t ∈ [0, T ], x = (x1 , ..., xd ), y = (y1 , ..., yd ) ∈ Rd that ud (T, x) = gd (x),
h i
|f (x1 ) − f (y1 )| ≤ c|x1 − y1 |, |ud (t, x)|2 ≤ c dc + kxk2 , (183)
∂
ud (t, x) + h(∇x ud )(t, x), µd (x)i + 12 Tr (σd (x)[σd (x)]∗ (Hessx ud )(t, x)) + f (ud (t, x)) = 0, (184)
∂t
for all d ∈ N, v ∈ Rd , ε ∈ (0, 1] let σ̂d,ε ∈ C(Rd , Rd×d ), gd,ε , µ̃d,ε , σ̃d,ε,v ∈ N satisfying for all d ∈
2
N, x, y, v ∈ Rd , ε ∈ (0, 1], that R(gd,ε ) ∈ C(Rd , R), R(µ̃d,ε ) ∈ C(Rd , Rd ), R(σ̃d,ε ) ∈ C(Rd , Rd ),
R(σ̃d,ε,v ) = σ̂d,ε (·)v, D(σ̃d,ε,v ) = D(σ̃d,ε,0 ),
h i
|(R(gd,ε ))(x)|2 + max (|[(R(µ̃d,ε ))(0)]i | + |(σ̂d,ε (0))i,j |) ≤ c dc + kxk2 , (185)
i,j∈{1,...,d}
c
max{|gd (x) − (R(gd,ε ))(x)|, kµd (x) − (R(µ̃d,ε ))(x)k , kσd (x) − σ̂d,ε (x)kF } ≤ εcdc (1 + kxk ), (186)
max{|(R(gd,ε ))(x) − (R(gd,ε ))(y)|, k(R(µ̃d,ε ))(x) − (R(µ̃d,ε ))(y)k , kσ̂d,ε (x) − σ̂d,ε (y)kF } ≤ c kx − yk ,
(187)
max{P(gd,ε ), P(µ̃d,ε ), P(σ̃d,ε,v )} ≤ cdc ε−c . (188)
Then there exist (Ψd,ε )d∈N,ε∈(0,1] ⊆ N, η ∈ (0, ∞) such that for all d ∈ N, ε ∈ (0, 1] it holds that R(Ψd,ε ) ∈
C(Rd , R), P(Ψd,ε ) ≤ ηdη ε−η and
"Z # 12
|ud (0, x) − (R(Ψd,ε ))(x)|2 dx ≤ ε. (189)
[0,1]d
Proof. Throughout the proof assume without loss of generality that c ∈ [2, ∞) (otherwise adapt the
application of Theorem 4.1 below accordingly), and let B = 2c. We thus have that
31
Next, (185) ensures for all d ∈ N, x ∈ Rd , ε ∈ (0, 1] that
√ c 1
2 2 √ √ 1
2 2
|(R(gd,ε ))(x)| ≤ c d + kxk ≤ ( c + c T ) d2c + kxk , (193)
21
k(R(µ̃d,ε ))(0)k ≤ d max |[(R(µ̃d,ε ))(0)]i |2 ≤ cdc+1 , (194)
i∈{1,...,d}
and that 12
2 2
kσ̂d,ε (0)kF ≤ d max |(σ̂d,ε (0))i,j | ≤ cdc+1 . (195)
i,j∈{1,...,d}
References
[1] Ankudinova, J., and Ehrhardt, M. On the numerical solution of nonlinear Black–Scholes
equations. Computers & Mathematics with Applications 56, 3 (2008), 799–812. Mathematical Models
in Life Sciences & Engineering.
[2] Beck, C., Becker, S., Cheridito, P., Jentzen, A., and Neufeld, A. Deep splitting method
for parabolic PDEs, 2021, arXiv: 1907.03452.
[3] Beck, C., E, W., and Jentzen, A. Machine learning approximation algorithms for high-
dimensional fully nonlinear partial differential equations and second-order backward stochastic dif-
ferential equations. Journal of Nonlinear Science 29, 4 (Jan 2019), 1563–1619.
[4] Beck, C., Hutzenthaler, M., and Jentzen, A. On nonlinear Feynman–Kac formulas for
viscosity solutions of semilinear parabolic partial differential equations. Stochastics and Dynamics
(Jul 2021), 2150048.
[5] Beck, C., Hutzenthaler, M., Jentzen, A., and Kuckuck, B. An overview on deep learning-
based approximation methods for partial differential equations, 2021, 2012.12348.
32
[6] Beck, C., Hutzenthaler, M., Jentzen, A., and Magnani, E. Full history recursive multilevel
Picard approximations for ordinary differential equations with expectations, 2021, arXiv: 2103.02350.
[7] Bergman, Y. Option pricing with differential interest rates. Review of Financial Studies 8 (04
1995), 475–500.
[8] Berner, J., Grohs, P., and Jentzen, A. Analysis of the generalization error: Empirical risk
minimization over deep artificial neural networks overcomes the curse of dimensionality in the numer-
ical approximation of Black–Scholes partial differential equations. SIAM Journal on Mathematics of
Data Science 2, 3 (Jan 2020), 631–657.
[9] Cox, S., Hutzenthaler, M., and Jentzen, A. Local Lipschitz continuity in the initial value
and strong completeness for nonlinear stochastic differential equations, 2021, 1309.5595.
[10] Crandall, M., Ishii, H., and Lions, P.-L. User’s guide to viscosity solutions of second order
partial differential equations. Bulletin of the American Mathematical Society 27 (07 1992).
[11] E, W., Han, J., and Jentzen, A. Deep learning-based numerical methods for high-dimensional
parabolic partial differential equations and backward stochastic differential equations. Communica-
tions in Mathematics and Statistics 5, 4 (Nov 2017), 349–380.
[12] E, W., Hutzenthaler, M., Jentzen, A., and Kruse, T. Multilevel Picard iterations for solving
smooth semilinear parabolic heat equations, 2019, arXiv: 1607.03295.
[13] E, W., Hutzenthaler, M., Jentzen, A., and Kruse, T. On multilevel Picard numerical approx-
imations for high-dimensional nonlinear parabolic partial differential equations and high-dimensional
nonlinear backward stochastic differential equations. Journal of Scientific Computing 79, 3 (Mar
2019), 1534–1571.
[14] Fehrman, B., Gess, B., and Jentzen, A. Convergence rates for the stochastic gradient descent
method for non-convex objective functions, 2019, arXiv: 1904.01517.
[15] Grohs, P., and Herrmann, L. Deep neural network approximation for high-dimensional elliptic
PDEs with boundary conditions, 2020, arXiv: 2007.05384.
[16] Grohs, P., and Herrmann, L. Deep neural network approximation for high-dimensional parabolic
Hamilton-Jacobi-Bellman equations, 2021, arXiv: 2103.05744.
[17] Grohs, P., Hornung, F., Jentzen, A., and von Wurstemberger, P. A proof that artificial
neural networks overcome the curse of dimensionality in the numerical approximation of Black-
Scholes partial differential equations, 2018, arXiv: 1809.02362.
[18] Grohs, P., Hornung, F., Jentzen, A., and Zimmermann, P. Space-time error estimates for
deep neural network approximations for differential equations, 2019, arXiv: 1908.03833.
[19] Han, J., Jentzen, A., and E, W. Solving high-dimensional partial differential equations us-
ing deep learning. Proceedings of the National Academy of Sciences 115, 34 (2018), 8505–8510,
https://www.pnas.org/content/115/34/8505.full.pdf.
[20] Hutzenthaler, M., and Jentzen, A. On a perturbation theory and on strong convergence rates
for stochastic ordinary and partial differential equations with nonglobally monotone coefficients. The
Annals of Probability 48, 1 (Jan 2020).
33
[21] Hutzenthaler, M., Jentzen, A., Kruse, T., Anh Nguyen, T., and von Wurstemberger,
P. Overcoming the curse of dimensionality in the numerical approximation of semilinear parabolic
partial differential equations. Proceedings of the Royal Society A: Mathematical, Physical and Engi-
neering Sciences 476, 2244 (Dec 2020), 20190630.
[22] Hutzenthaler, M., Jentzen, A., Kruse, T., and Nguyen, T. A. Multilevel Picard approx-
imations for high-dimensional semilinear second-order PDEs with Lipschitz nonlinearities, 2020,
arXiv:2009.02484.
[23] Hutzenthaler, M., Jentzen, A., Kruse, T., and Nguyen, T. A. A proof that rectified deep
neural networks overcome the curse of dimensionality in the numerical approximation of semilinear
heat equations. SN Partial Differential Equations and Applications 1, 2 (Apr 2020).
[24] Jentzen, A., Salimova, D., and Welti, T. A proof that deep artificial neural networks over-
come the curse of dimensionality in the numerical approximation of Kolmogorov partial differential
equations with constant diffusion and nonlinear drift coefficients, 2019, arXiv: 1809.07321.
[25] Reisinger, C., and Zhang, Y. Rectified deep neural networks overcome the curse of dimensionality
for nonsmooth value functions in zero-sum games of nonlinear stiff systems. Analysis and Applications
18 (05 2020).
[26] Rogers, L. C. G., and Williams, D. Diffusions, markov processes and martingales, Volume
2: Itô Calculus, second edition. ed. Cambridge mathematical library. Cambridge University Press,
Cambridge, 2000.
[27] Takahashi, A., and Yamada, T. Asymptotic Expansion and Deep Neural Networks Overcome the
Curse of Dimensionality in the Numerical Approximation of Kolmogorov Partial Differential Equa-
tions with Nonlinear Coefficients. CIRJE F-Series CIRJE-F-1167, CIRJE, Faculty of Economics,
University of Tokyo, May 2021.
[28] Walter, W. Ordinary Differential Equations. Graduate Texts in Mathematics. Springer, 1998.
34