2205.14398 Deep Neural Networks Overcome The Curse

Deep neural networks overcome the curse of dimensionality in
the numerical approximation of semilinear partial differential

equations
Petru A. Cioica-Licht1 , Martin Hutzenthaler2 , and P. Tobias Werner3
arXiv:2205.14398v1 [math.NA] 28 May 2022
1
Institute of Mathematics, University of Kassel,
cioica-licht@mathematik.uni-kassel.de
2
Faculty of Mathematics, University of Duisburg-Essen,
martin.hutzenthaler@uni-due.de
3
Institute of Mathematics, University of Kassel,twerner@mathematik.uni-kassel.de
May 31, 2022
Abstract
We prove that deep neural networks are capable of approximating solutions of semilinear Kolmogorov
PDE in the case of gradient-independent, Lipschitz-continuous nonlinearities, while the required
number of parameters in the networks grow at most polynomially in both dimension d ∈ N and
prescribed reciprocal accuracy ε. Previously, this has only been proven in the case of semilinear heat
equations.
Contents
1 Introduction 2
1.1 An Introduction to MLP-approximations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 A perturbation result for MLP-approximations 6

2.1 An estimate for Lyapunov-type functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 A perturbation estimate for solutions of PDEs . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 A full error estimate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3 Deep Neural Networks 13

3.1 A mathematical framework for deep neural networks . . . . . . . . . . . . . . . . . . . . . 14
3.2 Properties of the proposed deep neural networks . . . . . . . . . . . . . . . . . . . . . . . 15
3.3 A DNN representation for time-recursive functions . . . . . . . . . . . . . . . . . . . . . . 16
3.4 A DNN representation of Euler-Maruyama approximations . . . . . . . . . . . . . . . . . 18
3.5 A DNN representation of MLP approximations . . . . . . . . . . . . . . . . . . . . . . . . 19
4 Deep neural network approximations for PDEs 23

4.1 Deep neural network approximations with specific polynomial convergence rates . . . . . . 23
4.2 Deep neural network approximations with positive polynomial convergence rates . . . . . 31
1
1 Introduction
Deep-learning based algorithms are employed in a wide field of real world applications and are nowadays
the standard approach for most machine learning related problems. They are used extensively in face and
speech recognition, fraud detection, function approximation, and solving of partial differential equations
(PDEs). The latter are utilized to model numerous phenomena in nature, medicine, economics, and
physics. Often, these PDEs are nonlinear and high-dimensional. For instance, in the famous Black-Scholes
model the PDE dimension d ∈ N corresponds to the number of stocks considered in the model. Relaxing
the rather unrealistic assumptions made in the model results in a loss of linearity within the corresponding
Black-Scholes PDE, cf., e.g., [1, 7]. These PDEs, however, can not in general be solved explicitly and
therefore need to be approximated numerically. Classical grid based approaches such as finite difference
or finite element methods suffer from the curse of dimensionality in the sense that the computational
effort grows exponentially in the PDE dimension d ∈ N and are therefore not appropriate. Among
the most prominent and promising approaches to approximate high-dimensional PDEs are deep-learning
algorithms, which seem to handle high-dimensional problems well, cf., e.g., [2, 3, 5, 11, 19]. However,
compared to the vast field of applications these techniques are successfully applied to nowadays, there
exist only few theoretical results proving that deep neural networks do not suffer from the curse of
dimensionality in the numerical approximation of partial differential equations, see [8, 15–17, 23–25, 27].
The key contribution of this article is a rigorous proof that deep neural networks do overcome the
curse of dimensionality in the numerical approximation of semilinear partial differential equations, where
the nonlinearity, the coefficients determining the linear part, and the initial condition of the PDEs are
Lipschitz continuous. In particular, Theorem 1.1 below proves that the number of parameters in the
approximating deep neural network grows at most polynomially in both the reciprocal of the described
approximation accuracy ε ∈ (0, 1] and the PDE dimension d ∈ N:
d d d
Theorem 1.1. Let k·k : (∪q d∈N R ) → [0, ∞) and Ad : R → R , d ∈ N, satisfy for all d ∈ N, x =
Pd
(x1 , ..., xd ) ∈ Rd that kxk = 2
i=1 (xi ) and Ad (x) = (max{x1 , 0}, ..., max{xd , 0}), let k·kF : (∪d∈N R
d×d
)
qP
d×d d 2
→ [0, ∞) satisfy for all d ∈ N, A = (ai,j )i,j∈{1,...,d} ∈ R that kAkF = i,j=1 (ai,j ) , let h·, ·i :
d d d
Pd
(∪d∈N R ×R ) → R satisfy for all d ∈ N, x = (x1 , ..., xd ), y = (y1 , ..., yd ) ∈ R that hx, yi = i=1 xi yi ,
QH+1 kn ×kn−1
let N = ∪H∈N ∪(k0 ,...,kH+1 )∈NH+2 [ n=1 (R × Rkn )], let R : N → (∪k,l∈N C(Rk , Rl )), D : N →
H+2
∪ N , and P : N → N satisfy for all H ∈ N, k0 , k1 , ..., kH , kH+1 ∈ N, Φ = ((W1 , B1 ), ..., (WH+1 , BH+1 )) ∈
QH∈N
H+1 kn ×kn−1
n=1 (R × Rkn ), x0 ∈ Rk0 , ..., xH ∈ RkH with ∀n ∈ N ∩ [1, H] : xn = Akn (Wn xn−1 + Bn ) that
R(Φ) ∈ C(Rk0 , RkH+1 ), (R(Φ))(x0 ) = WH+1 xH + BH+1 , (1)

H+1
X
D(Φ) = (k0 , k1 , ..., kH , kH+1 ), and P(Φ) = kn (kn−1 + 1), (2)
n=1
let c, T ∈ (0, ∞), f ∈ C(R, R), for all d ∈ N let gd ∈ C(Rd , R), ud ∈ C 1,2 ([0, T ] × Rd , R), µd =
(µd,i )i∈{1,...,d} ∈ C(Rd , Rd ), σd = (σd,i,j )i,j∈{1,...,d} ∈ C(Rd , Rd×d ) satisfy for all t ∈ [0, T ], x = (x1 , ..., xd ),
y = (y1 , ..., yd ) ∈ Rd that ud (0, x) = gd (x),
h i
2
|f (x1 ) − f (y1 )| ≤ c|x1 − y1 |, |ud (t, x)|2 ≤ c dc + kxk , (3)
∂
ud (t, x) = h(∇x ud )(t, x), µd (x)i + 21 Tr (σd (x)[σd (x)]∗ (Hessx ud )(t, x)) + f (ud (t, x)), (4)
∂t
for all d ∈ N, v ∈ Rd , ε ∈ (0, 1] let σ̂d,ε ∈ C(Rd , Rd×d ), gd,ε , µ̃d,ε , σ̃d,ε,v ∈ N satisfy for all d ∈ N, x, y, v ∈
Rd , ε ∈ (0, 1], that R(gd,ε ) ∈ C(Rd , R), R(µ̃d,ε ) ∈ C(Rd , Rd ), R(σ̃d,ε,v ) ∈ C(Rd , Rd ), (R(σ̃d,ε,v ))(x) =
2
σ̂d,ε (x)v, D(σ̃d,ε,v ) = D(σ̃d,ε,0 ),
h i
2
|(R(gd,ε ))(x)|2 + max (|[(R(µ̃d,ε ))(0)]i | + |(σ̂d,ε (0))i,j |) ≤ c dc + kxk , (5)
i,j∈{1,...,d}
max{|gd (x) − (R(gd,ε ))(x)|, kµd (x) − (R(µ̃d,ε ))(x)k , kσd (x) − σ̂d,ε (x)kF } ≤ εcdc (1 + kxkc ), (6)
max{|(R(gd,ε ))(x) − (R(gd,ε ))(y)|, k(R(µ̃d,ε ))(x) − (R(µ̃d,ε ))(y)k , kσ̂d,ε (x) − σ̂d,ε (y)kF } ≤ c kx − yk ,
(7)
max{P(gd,ε ), P(µ̃d,ε ), P(σ̃d,ε,v )} ≤ cdc ε−c . (8)
Then there exists η ∈ (0, ∞), such that for all d ∈ N, ε ∈ (0, 1] there exists a Ψd,ε ∈ N satisfying
R(Ψd,ε ) ∈ C(Rd , R), P(Ψd,ε ) ≤ ηdη ε−η and
"Z # 21
2
|ud (T, x) − (R(Ψd,ε ))(x)| dx ≤ ε. (9)
[0,1]d
Theorem 1.1 follows from Corollary 4.2 (applied with ud (t, x) x ud (T − t, x) for t ∈ [0, T ], x ∈ Rd in the
notation of Corollary 4.2), which in turn is a special case of Theorem 4.1, the main result of this article.
Let us comment on the mathematical objects appearing in Theorem 1.1. By N in Theorem 1.1 above
we denote the set of all artificial feed-forward deep neural networks (DNNs), by R in Theorem 1.1 above
we denote the operator that maps each DNN to its corresponding function, by P in Theorem 1.1 above
we denote the function that maps a DNN to its number of parameters. For all d ∈ N we have that
Ad : Rd → Rd in Theorem 1.1 above refers to the componentwise applied rectified linear unit (ReLU)
activation fucntion. The functions µd : Rd → Rd and σd : Rd → Rd×d in Theorem 1.1 above specify the
linear part of the PDEs in (4). The functions gd : Rd → R, d ∈ N, describe the initial condition, while the
function f : R → R specifies the nonlinearity of the PDEs in (4) above. The real number T ∈ (0, ∞) de-
scribes the time horizon of the PDEs in (4) in Theorem 1.1 above. The real number c ∈ (0, ∞) in Theorem
1.1 above is used to formulate a growth condition and a Lipschitz continuity condition of the functions
ud : Rd → R, R(µ̃d,ε ) : Rd → Rd , σ̂d,ε : Rd → Rd×d , R(gd,ε ) : Rd → R, d ∈ N, ε ∈ (0, 1], and f : R → R.
Furthermore, the real number c ∈ (0, ∞) in Theorem 1.1 above formulates an approximation condition
between the functions µd , σd , gd and R(µ̃d,ε ), σ̂d,ε , R(gd,ε ), d ∈ N, ε ∈ (0, 1], as well as an upper boundary
condition for the number of parameters involved in the DNNs gd,ε , µ̃d,ε , σ̃d,ε,v , d ∈ N, v ∈ Rd , ε ∈ (0, 1].
These assumptions guarantee unique existence of viscosity solutions of the PDEs in (4) in Theorem 1.1
above (see Theorem 4.1 and cf., e.g., [10] for the notion of viscosity solution), and that the functions
µd , σd , gd , d ∈ N, can be approximated without the curse of dimensionality by means of DNNs. The
latter is the key assumption in Theorem 1.1 above and means in particular, that Theorem 1.1 can be,
very roughly speaking, reduced to the statement that if DNNs are capable of approximating the initial
condition and the linear part of the PDEs in (4) without the curse of dimensionality, then they are also
capable of approximating its solution without the curse of dimensionality.
Note that the statement of Theorem 1.1 is purely deterministic. The proof of Theorem 4.1, however,
heavily relies on probabilistic tools. It is influenced by the method employed in [23] and is built on the
theory of full history recursive multilevel Picard approximations (MLP), which are numerical approxi-
mation schemes that have been proven to overcome the curse of dimensionality in numerous settings, cf.,
e.g., [6, 12, 13, 21, 22]. In particular, in [22] a MLP approximation scheme is introduced that overcomes
the curse of dimensionality in the numerical approximation of the semilinear PDEs in (4). The central
idea of the proof is, roughly speaking, that these MLP approximations can be well represented by DNNs,
if the coefficients determining the linear part, the initial condition, and the nonlinearity are themselves
represented by DNNs (cf. Lemma 3.10). This explains why we need to assume that the coefficients of the
PDEs in (4) can be approximated by means of DNNs without the curse of dimensionality. In Lemma 2.4
below we establish strong approximation error bounds between solutions of the PDEs in (4) and solutions
3
of the PDEs whose coefficients are specified by corresponding DNN approximations. In Lemma 2.5 below
we construct an artificial probability space and employ a result from [22] to estimate strong error bounds
for the numerical approximation of the approximating PDEs by means of MLP approximations on that
space. In our main result, Theorem 4.1 below, we combine these results and prove existence of a DNN
realization on the artifical probability space that suffices the prescribed approximation accuracy ε ∈ (0, 1]
between the solution of the PDEs in (4) and the constructed DNN, which eventually completes the proof
of Theorem 1.1 above.
The remainder of this article is organized as follows: In section 2 we will establish a full error bound for
MLP approximations in which every involved function is approximated (by means of DNNs). Section
3 provides a mathematical framework for deep neural networks and establishes the connection between
DNNs and MLP approximations, see Lemma 3.10. In section 4 we combine the findings of section 2 and
section 3 to prove our main result, Theorem 4.1, as well as Corollary 4.2.
1.1 An Introduction to MLP-approximations

Section 2 below develops a result which proves stability for multilevel Picard approximations against
perturbations in various components. To give an intuition, we will first informally introduce multilevel
Picard approximations. Consider the semilinear PDE
∂ 1
u(t, x) + hµ(x), (∇x u)(t, x)i + Tr (σ(x)[σ(x)]∗ (Hessx u)(t, x)) = −f (u(t, x)) (10)
∂t 2
with terminal condition u(T, x) = g(x), where u ∈ C 1,2 ([0, T ] × Rd , R), µ ∈ C(Rd , Rd ), σ ∈ C(Rd , Rd×d ),
g ∈ C(Rd , R) and f ∈ C(R, R), f being the nonlinearity. Our goal is to find a numerical method to solve
the PDE in (10), which does not suffer from the curse of dimensionality. Classical grid based approaches
are not suitable for this task and we may consider meshfree techniques such as Monte Carlo methods.
We thus consider a filtered probability space (Ω, F , P, (Ft )t∈[0,T ] ) satisfying the usual conditions 1 and a
standard (Ft )t∈[0,T ] -Brownian motion W : [0, T ] × Ω → Rd . As we will see in Theorem 4.1, under suitable
assumptions there exist for all t ∈ [0, T ], x ∈ Rd up to indistinguishability unique, (Fs )s∈[t,T ] -adapted
stochastic processes Xtx : [t, T ] × Ω → Rd , satisfying for all s ∈ [t, T ] that
Z s Z s
x x x
Xt,s =x+ µ(Xt,r )dr + σ(Xt,r )dWr (11)
t t
and Z
T
x x
u(t, x) = E g(Xt,T ) + E f (u(r, Xt,r )) dr. (12)
t
The reverse direction does not work in general, because differentiability of the function in (12) is not
necessarily given. This would need sufficient smoothness of µ, σ, f, and g. Fortunately, one can prove
under more relaxed conditions, that (12) yields a unique, so called viscosity solution of the PDE in (10),
which is a generalized concept of classical solutions. For an introduction and a review on the theory of
viscosity solutions we refer to [10]. In particular, every classical solution of (10) is also a viscosity solution
of (10), hence we get a one-to-one correspondence between both representations. Having established this
connection, we may want to employ a numerical approximation of (12) by means of Monte Carlo methods.
The second term on the right-hand side of (12), however, is problematic since we need to know the very
thing that we wish to approximate, u. A similar problem arises in the field of ordinary differential
1 Note that we say that a filtered probability space (Ω, F , P, (F )
t t∈[0,T ] ) satisfies the usual conditions if and only if it
holds for all t ∈ [0, T ) that {A ∈ F : P(A) = 0} ⊆ Ft = (∪s∈(t,T ] Fs ).
4
equations: let y : [0, T ] → R satisfy y(0) = y0 and for all t ∈ (0, T ) that y ′ = f(y), where f ∈ C(R, R).
This can be reformulated to Z t
y(t) = y0 + f(y(s))ds. (13)
0
If we assume the function f to satisfy a (local) Lipschitz condition, it is ensured that for sufficiently small
t ∈ (0, T ], the Picard iterations
y [0] (t) = y0 ,
Z t (14)
y [n+1] (t) = y0 + f(y [n] (s))ds, n ∈ N0
0
converge to a unique solution of (13), cf., e.g., [28, §6]. Following this idea, we reformulate (12) as
stochastic fixed point equation:

Φ0 (t, x) = E g(Xt,T
x
)
Z T (15)

Φn+1 (t, x) = E g(Xt,T
x
) + E f (Φn (s, Xt,s
x
)) ds, n ∈ N0 .
t
d
Indeed, the proof of [22, Proposition 2.2] provides a Banach space (V, k·kV ), V ⊂ R[0,T ]×R , in which the
mapping Φ : V → V satisfying for all t ∈ [0, T ], x ∈ Rd , v ∈ V that
Z T
x
x

(Φ(v))(t, x) = E g(Xt,T ) + E f (v(s, Xt,s )) ds (16)
t
is, under suitable assumptions, a contraction. Banach’s fixed point theorem thus provides existence
of a unique fixed point u ∈ V with Φ(u) = u and proves convergence of (15). In order to derive an
approximation scheme for the right-hand side of (15), we employ a telescoping sum argument and define
Φ−1 ≡ 0, which gives us for all n ∈ N, t ∈ [0, T ], x ∈ Rd
Z T

Φn+1 (t, x) = E g(Xt,T
x
) + E f (Φn (s, Xt,sx
)) ds
t
Z T n
x
X
= E g(Xt,T ) + E f (Φ0 (s, Xt,s
x
)) + E f (Φl (s, Xt,s
x
)) − f (Φl−1 (s, Xt,s
x
)) ds (17)
t l=1
n Z
X T
)) − 1N (l)f (Φl−1 (s, Xt,s
x

=E g(Xt,T ) + E f (Φl (s, Xt,s
x x
)) ds.
l=0 t
x
With a standard Monte Carlo approach we can approximate E[g(Xt,T )] efficiently. For the sum on the
right-hand side of above equation, note that
RT
a) one can, for all t ∈ [0, T ], h ∈ L1 ([t, T ], B([t, T ]), λ|[t,T ]; R), approximate t h(s)dλ(s) via Monte Carlo
averages by considering the probability space ([t, T ], B([t, T ]), T 1−t λ|[t,T ] ) and sampling independent,
continuous uniformly distributed τ0 , ..., τM : Ω → [t, T ], M ∈ N. This yields
Z T Z T M
1 T −tX
h(s)dλ(s) = (T − t) h(s)d λ|[t,T ] (s) ≈ h(τk ). (18)
t t T −t M
k=0
b) due to the recursive structure of (17), summands of large l are costly to compute, but thanks to
convergence, tend to be of small variance. In order to achieve a prescribed approximation accuracy,
5
these summands will only need relatively few samples for a suitable Monte-Carlo average. On the
other hand, summands of small l are cheap to compute, but have large variance, implying the need of
relatively many samples for a Monte-Carlo average that satisfies a prescribed approximation accuracy.
This is the key insight, which motivates the multilevel approach.
To present the multilevel approach, we need a larger index set. Let Θ = ∪n∈N Zn . Let Ytθ,x : [t, T ] × Ω →
Rd , θ ∈ Θ, t ∈ [0, T ], x ∈ Rd be independent Euler-Maruyama approximations of Xtx (see (57) below
for a concise definition). Let Tθt : Ω → [t, T ], θ ∈ Θ, t ∈ [0, T ], be independent, continuous, uniformly
distributed random variables. In particular, let (Ytθ,x )θ∈Θ and (Tθt )θ∈Θ be independent. Combining this
with (17), a) and b) gives us the desired multilevel Picard approximations (see (58) below for a concise
definition):
n
θ 1N (n) X
M
(θ,0,−i),x
Un,M (t, x) = g(Yt,T )
Mn i=0
(19)
n−1 M n−l
X T −t X
+ f ◦ U
(θ,l,i)
l,M − 1 N (l)f ◦ U
(θ,−l,i)
l−1,M T
(θ,l,i)
t , Y
(θ,l,i),x
(θ,l,i) .
M n−l i=0 t,Tt
l=0
In section 3 we will show that deep neural networks are capable to represent MLP approximations, if
µ, σ, f, and g are DNN functions. For this reason, we assume in section 2 that for a given PDE such as
(10) with coefficient functions µ1 , σ1 , f1 , and g1 , we can approximate its coefficients by (DNN functions)
µ2 , σ2 , f2 , and g2 , respectively. Thus we have to take two sources of error into account:
(i) the perturbation error of the PDE. We will utilize stochastic tools to get an estimate of |u1 (t, x) −
u2 (t, x)|, where for each i ∈ {1, 2} we have
h i Z T h i
x,i x,i
ui (t, x) = E gi (Xt,T ) + E f (ui (s, Xt,s )) ds, (20)
t
Xtx,i solving the stochastic differential equation (SDE) with drift µi and diffusion σi with respect
to W, starting in (t, x) ∈ [0, T ] × Rd (cf. (11), (12), and Lemma 2.4 below).
h i 21
0 2 θ
(ii) the MLP error E Un,M (t, x) − u2 (t, x) , where Un,M , θ ∈ Θ, are the MLP approximations
built by the coefficient functions µ2 , σ2 , f2 , and g2 .
With that, the triangle inequality enables us to find a boundary for the full error:
h 2
i 21 h 2
i 21
0 0
E Un,M (t, x) − u1 (t, x) ≤ E Un,M (t, x) − u2 (t, x) + |u2 (t, x) − u1 (t, x)| . (21)
2 A perturbation result for MLP-approximations

Throughout this section, let h·, ·i : ∪d∈N Rd × Rd → R, k·k : ∪d∈N Rd → [0, ∞), k·kF : ∪d,m∈N Rd×m →
[0, ∞) satisfy for all d, m ∈ N, x = (x1 , ..., xd ), y = (y1 , ..., yd ) ∈ Rd , A = (ai,j )i∈{1,...,d},j∈{1,...,m} ∈ Rd×m
Pd P 1 P Pm 1
d 2 2 d 2 2
that hx, yi = i=1 xi yi , kxk = i=1 |x i | , kAk F = i=1 |a
j=1 i,j | .
2.1 An estimate for Lyapunov-type functions.

We start with an auxiliary result, which states that we can raise certain Lyapunov-type functions to any
power, without losing its growth property.
6
Lemma 2.1. (i) Let d, m ∈ N, c, κ ∈ [1, ∞), p ∈ [2, ∞), let ϕ ∈ C 2 (Rd , [1, ∞)), µ ∈ C(Rd , Rd ), σ ∈
C(Rd , Rd×m ) satisfy for all x, y ∈ Rd , z ∈ Rd \ {0} that
( )
h(∇ϕ)(x), zi hz, (Hessϕ(x))zi c kxk + max{kµ(0)k , kσ(0)kF }
max p−1 , p−2
2
, 1 ≤ c, (22)
(ϕ(x)) p kzk (ϕ(x)) p kzk (ϕ(x)) p
and
max{kµ(x) − µ(y)k , kσ(x) − σ(y)kF } ≤ c kx − yk . (23)
d
Then it holds for all x ∈ R that
1 κ
|h∇(ϕ(x))κ , µ(x)i| + Tr(σ(x)[σ(x)]∗ (Hess(ϕ(x))κ )) ≤ ((κ − 1)c4 + 3c3 )(ϕ(x))κ . (24)
2 2
(ii) Additionally let T ∈ (0, ∞), (Ω, F , P, (Ft )t∈[0,T ] ) be a filtered probability space satisfying the usual
conditions, let W : [0, T ] × Ω → Rm be a standard (Ft )t∈[0,T ] -Brownian motion, for all x ∈ Rd , t ∈
[0, T ] let Xtx : [t, T ] × Ω → Rd be an (Fs )s∈[t,T ] -adapted stochastic process satisfying P-a.s. for all
s ∈ [t, T ] Z Z
s s
x x x
Xt,s =x+ µ(Xt,r )dr + σ(Xt,r )dWr . (25)
t t
Then it holds for all t ∈ [0, T ], s ∈ [t, T ], x ∈ Rd that
x
1 4 3
E (ϕ(Xt,s ))κ ≤ e 2 κ((κ−1)c +3c )(s−t) (ϕ(x))κ . (26)
Proof. Note that for all x ∈ Rd it holds that ∇(ϕ(x))κ = κϕκ−1 (x)(∇ϕ)(x) and
Hess(ϕ(x))κ = κ(κ − 1)(ϕ(x))κ−2 ∇(ϕ(x))[∇(ϕ(x))]∗ + κ(ϕ(x))κ−1 Hessϕ(x). (27)
d
The triangle inequality, (22), and (23) ensure for all x ∈ R that
1
max{kµ(x)k , kσ(x)kF } ≤ c kxk + max{kµ(0)k , kσ(0)kF } ≤ c(ϕ(x)) p . (28)
This and (22) ensure for all x ∈ Rd that
p−2
Tr(σ(x)[σ(x)]∗ Hessϕ(x)) ≤ c kσ(x)k2F (ϕ(x)) p ≤ c3 ϕ(x). (29)
The Cauchy-Schwarz inequality and (22) ensure for all x ∈ Rd that
Tr(σ(x)[σ(x)]∗ ∇ϕ(x)[∇ϕ(x)]∗ ) ≤ kσ(x)[σ(x)]∗ kF k∇ϕ(x)[∇ϕ(x)]∗ kF
2 2 (30)
≤ kσ(x)kF k∇ϕ(x)k ≤ c4 (ϕ(x))2 .
It thus holds for all x ∈ Rd that
Tr(σ(x)[σ(x)]∗ Hess((ϕ(x))κ )) = κ(κ − 1)(ϕ(x))κ−2 Tr(σ(x)[σ(x)]∗ ∇ϕ(x)[∇ϕ(x)]∗ )
+ κ(ϕ(x))κ−1 Tr(σ(x)[σ(x)]∗ Hessϕ(x)) (31)
4 κ 3 κ
≤ κ(κ − 1)c (ϕ(x)) + κc (ϕ(x)) .
This and (22) ensure for all x ∈ Rd that
1
|h∇(ϕ(x))κ , µ(x)i| + Tr(σ(x)[σ(x)]∗ Hess(ϕ(x))κ )
2
1
≤ κ(ϕ(x))κ |h(∇ϕ)(x), µ(x)i| + κ((κ − 1)c4 + c3 )(ϕ(x))κ
2 (32)
1
≤ c2 κ(ϕ(x))κ + κ((κ − 1)c4 + c3 )(ϕ(x))κ
2
1 3
≤ ( κ(κ − 1)c4 + κc3 )(ϕ(x))κ ,
2 2
7
which establishes item (i). This and [9, Lemma 2.2] (applied for every x ∈ Rd , t ∈ [0, T ], s ∈ [t, T ]
with T x T − t, O x Rd , V x ([0, T − t] × Rd ∋ (s, x) → (ϕ(x))κ ∈ [1, ∞)), α x ([0, T − t] ∋ s →
( 21 κ(κ − 1)c4 + 32 κc3 ) ∈ [0, ∞)), τ x s − t, X x (Xt,t+rx
)r∈[0,T −t] , in the notation of [9, Lemma 2.2])
d
proves that for all x ∈ R , t ∈ [0, T ], s ∈ [t, T ] we have that
x
1 4 3
E ϕ(Xt,s ) ≤ e 2 κ((κ−1)c +3c )(s−t) (ϕ(x))κ . (33)
This establishes item (ii) and thus finishes the proof.
2.2 A perturbation estimate for solutions of PDEs

Setting 2.2. Let d, m ∈ N, δ, T ∈ (0, ∞), β, b, c ∈ [1, ∞), q ∈ [2, ∞), p ∈ [2β, ∞), for all i ∈ {1, 2}
d
let ϕ ∈ C 2 (Rd , [1, ∞)), gi ∈ C(Rd , R), µi ∈ C(Rd , Rd ), σi ∈ C(Rd , Rd×m ), fi ∈ C(R, R), Fi : R[0,T ]×R →
d
R[0,T ]×R satisfy for all t ∈ [0, T ], v, w ∈ R, x, y ∈ Rd , z ∈ Rd \ {0}, u : [0, T ] × Rd → R that (Fi (u))(t, x) =
fi (t, x, u(t, x)),
h(∇ϕ)(x), zi hz, (Hessϕ(x))zi c kxk + max{kµi (0)k , kσi (0)kF }

max{ p−1 , p−2
2
, 1 } ≤ c, (34)
(ϕ(x)) p kzk (ϕ(x)) p kzk (ϕ(x)) p
β
max{T |fi (t, x, 0)|, |gi (x)|} ≤ b(ϕ(x)) p , (35)
β
(ϕ(x) + ϕ(y)) kx − yk
p
max{|gi (x) − gi (y)|, T |fi (t, x, v) − fi (t, y, w)|} ≤ cT |v − w| + 1 , (36)
T 2 b−1
max{kµi (x) − µi (y)k , kσi (x) − σi (y)kF } ≤ c kx − yk , (37)
max{|f1 (t, x, v)− f2 (t, x, v)|, |g1 (x)− g2 (x)|, kµ1 (x) − µ2 (x)k , kσ1 (x) − σ2 (x)kF } ≤ δ((ϕ(x))q + |v|q ), (38)
let (Ω, F , P, (Ft )t∈[0,T ] ) be a filtered probability space satisfying the usual conditions, let W : [0, T ] × Ω →
Rm be a standard (Ft )t∈[0,T ] -Brownian motion, for all x ∈ Rd , t ∈ [0, T ], i ∈ {1, 2} let Xtx,i : [t, T ]×Ω → Rd
be an (Fs )s∈[t,T ] -adapted stochastic process satisfying for all s ∈ [t, T ] P-a.s.
Z s Z s
x,i x,i x,i
Xt,s =x+ µi (Xt,r )dr + σi (Xt,r )dWr , (39)
t t
for all t ∈ [0, T ], s ∈ [t, T ], r ∈ [s, T ], x ∈ Rd , i ∈ {1, 2} let

X x,i ,i x,i
P Xs,rt,s = Xt,r = 1, (40)
let u1 , u2 ∈ C([0, T ] × Rd , R), assume for all i ∈ {1, 2}, t ∈ [0, T ], x ∈ Rd that
!
|ui (s, y)| h i Z T h i
x,i x,i
sup sup + E gi (Xt,T ) + E (Fi (ui ))(s, Xt,s ) ds < ∞, (41)
s∈[0,T ] y∈Rd ϕ(y) t
h i Z T h i
x,i x,i
ui (t, x) = E gi (Xt,T ) + E (Fi (ui ))(s, Xt,s ) ds. (42)
t
Lemma 2.3. Assume Setting 2.2 and let x ∈ Rd , t ∈ [0, T ], s ∈ [t, T ]. Then it holds that
h qi βq βq 4 3 q q
x,2
E u2 (s, Xt,s ) ≤ 2q−1 bq (T + 1)q (ϕ(x)) p e 2p ((q−1)c +3c )(T −t)+c (T −s) . (43)
8
Proof. Display (41) and Lemma 2.1 (ii) (applied with κ x q in the notation of Lemma 2.1) ensure that
" x,2
!q #! 1q
h i q1 |u2 (s, Xt,s )|
x,2 q x,2
E |u2 (s, Xt,s )| ≤ E x,2 ϕ(Xt,s )
ϕ(Xt,s )
" #
|u2 (r, y)| h q i 1q (44)
x,2
≤ sup sup E ϕ(Xt,s )
r∈[t,T ] y∈Rd ϕ(y)
< ∞.
Note that (35), Jensen’s inequality, and Lemma 2.1 (ii) (applied with κ x q in the notation of Lemma
2.1) imply that
q1
h q i 1q βq
p
x,2 x,2
E g2 (Xt,T ) ≤ b E ϕ(Xt,T )
h
x,2
β
q i qp (45)
≤ b E ϕ(Xt,T )
β 4 β
+3c3 )(T −t)
≤ be 2p ((q−1)c (ϕ(x)) p
q
The triangle inequality, (36), (35), and Lemma 2.1 (ii) (applied with κ x 2 in the notation of Lemma
2.1) imply
Z ! 1q
T h qi
x,2
E (F2 (u2 ))(r, Xt,r ) dr
s
Z ! q1 Z ! q1
T h qi T h qi
x,2 x,2 x,2 x,2
≤ E f2 (r, Xt,r , u2 (r, Xt,r )) − f2 (r, Xt,r , 0) dr + E f2 (r, Xt,r , 0) dr
s s
Z ! q1 Z ! 1q
T h qi T h i
x,2 b x,2 βq
≤c E u2 (r, Xt,r ) dr + E (ϕ(Xt,r )) p dr (46)
s T s
Z ! q1 1
!1
T h qi h i q
x,2 b(T − s) q x,2 βq
≤c E u2 (r, Xt,r ) dr + sup E (ϕ(Xt,r )) p
s T r∈[s,T ]
Z ! q1 1
T h qi
x,2 b(T − s) q 2p
β 4 3 β
≤c E u2 (r, Xt,r ) dr + e ((q−1)c +3c )(T −t) (ϕ(x)) p .
s T
This, (40), the definition of u2 , the triangle inequality, Jensen’s inequality, Fubini’s theorem, and (45)
9
ensure that
" q
Z q #! q
h i T
x,2 q X x,2 ,2 X x,2 ,2
E |u2 (s, Xt,s )| = E g2 Xs,Tt,s + (F2 (u2 ))(r, Xs,rt,s )dr
s

Z ! q1 q
h q i 1q q−1
T h qi
x,2 x,2
≤  E g2 (Xt,T ) + (T − s) q E (F2 (u2 ))(r, Xt,r ) dr 
s
(47)
 ! 1q q
Z T h qi 4
q−1
x,2 β
+3c3 )(T −t) β
≤ (T − s) q c E u2 (r, Xt,r ) dr + 2be 2p ((q−1)c (ϕ(x)) p 
s
Z T h qi βq 4 3 βq
x,2
≤2 q−1
(T − s) q−1 q
c E u2 (r, Xt,r ) dr + 2q bq e 2p ((q−1)c +3c )(T −t) (ϕ(x)) p .
s
This, (44), and Gronwall’s inequality yield

h qi βq 4 3 q−1 q q βq
x,2
E u2 (s, Xt,s ) ≤ 2q bq e 2p ((q−1)c +3c )(T −t)+2 c (T −s) (ϕ(x)) p . (48)
This finishes the proof.

Lemma 2.4. Assume Setting 2.2. Then it holds for all t ∈ [0, T ], x ∈ Rd
h i 4 3 q q q 2 1
x,2 x,1
sup E u2 (s, Xt,s ) − u1 (s, Xt,s ) ≤ δ2q+2 bq (T + 1)eq(2qc +3c )(T −t)+2 c T +(c+1) (ϕ(x))q+ 2 . (49)
s∈[t,T ]
Proof. First, [20, Theorem 1.2] (applied for every t ∈ [0, T ], s ∈ [t, T ], x ∈ Rd with H x Rd , U x
Rd , D x Rd , T x s − t, (Fr )r∈[0,T ] x (Fr+t )r∈[0,s−t] , (Wr )r∈[0,T ] x (Wt+r − Wt )r∈[0,s−t] , (Xr )r∈[0,T ] x
x,1 x,2 x,2
(Xt,t+r )r∈[0,s−t] , (Yr )r∈[0,T ] x (Xt,t+r )r∈[0,s−t] , µ x µ1 , σ x σ1 , (ar )r∈[0,T ] x (µ2 (Xt,t+r ))r∈[0,s−t] ,
x,2
(br )r∈[0,T ] x (σ2 (Xt,t+r ))r∈[0,s−t] , τ x (Ω ∋ ω → s − t ∈ [0, s − t]), p x 2, q x ∞, r x 2, α x
1, β x 1, ε x 1 in the notation of [20, Theorem 1.2]), combined with the Cauchy-Schwartz inequality,
(37), and (38) show that for all t ∈ [0, T ], s ∈ [t, T ], x ∈ Rd we have
12
2
x,1 x,2
E Xt,s − Xt,s
" #+ 
2
hz − y, µ 1 (z) − µ 1 (y)i + kσ 1 (z) − σ1 (y)k F 1
≤ sup exp  + 
z,y∈Rd , kz − yk2 2
z6=y
Z s 12 Z s 21 !
x,2 x,2
2 √ x,2 x,2
2
(50)
· E µ2 (Xt,r ) − µ1 (Xt,r ) dr + 2 E σ2 (Xt,r ) − σ1 (Xt,r ) dr
t t
Z s 12
2
+ 12
√ h
x,2 2q
i
≤ ec+c δ(1 + 2) E ϕ(Xt,r ) dr
t
!1
√ √ h i 2
c+c2 + 12 x,2 2q
≤e δ(1 + 2) s − t sup E ϕ(Xt,r ) .
r∈[t,s]
This combined with Lemma 2.1 (ii) (applied with κ x 2q in the notation of 2.1) hence demonstrates for
all x ∈ Rd , t ∈ [0, T ], s ∈ [t, T ] that
12
x,1 x,2
2 √ 2 1√ 4 3
E Xt,s − Xt,s ≤ δ2 2ec +c+ 2 s − teq((2q−1)c +3c )(s−t) ϕ(x)q . (51)
10
This and Hölder’s inequality ensure for all x ∈ Rd , t ∈ [0, T ], s ∈ [t, T ] that
h i
x,1 x,2 β x,1 x,2
E (ϕ(Xt,s ) + ϕ(Xt,s )) p Xt,s − Xt,s
h i
x,1 x,2 12 x,1 x,2
≤ E (ϕ(Xt,s ) + ϕ(Xt,s )) Xt,s − Xt,s
h i 12 21
2
x,1 x,2 x,1 x,2
≤ E ϕ(Xt,s ) + ϕ(Xt,s ) E Xt,s − Xt,s (52)
21
√ 3 3 1 x,1 x,2
2
≤ 2e 4 c (s−t) (ϕ(x)) 2 E Xt,s − Xt,s
√ 4 3 2 1 1
≤ δ4 s − te(q((2q−1)c +3c ))(s−t)+c +c+ 2 (ϕ(x))q+ 2 .
Next, the triangle inequality and (40) guarantee for all t ∈ [0, T ], s ∈ [t, T ], x ∈ Rd that
h i
x,2 x,1
E u2 (s, Xt,s ) − u1 (s, Xt,s )
Z T
X x,2 ,2 X x,1 ,1 X x,2 ,2 X x,1 ,1
≤ E g2 Xs,Tt,s − g1 Xs,Tt,s + E (F2 (u2 )) r, Xs,rt,s − (F1 (u1 )) r, Xs,rt,s dr
s
h i Z T h i
x,2 x,1 x,2 x,1
= E g2 (Xt,T ) − g1 (Xt,T ) + E (F2 (u2 )) r, Xt,r − (F1 (u1 )) r, Xt,r dr
s
h i Z T h i
x,2 x,2 x,2 x,2
≤ E g2 (Xt,T ) − g1 (Xt,T ) + E (F2 (u2 )) r, Xt,r − (F1 (u2 )) r, Xt,r dr
s
h i Z T h i
x,2 x,1 x,2 x,1
+ E g1 (Xt,T ) − g1 (Xt,T ) + E (F1 (u2 )) r, Xt,r − (F1 (u1 )) r, Xt,r dr.
s
(53)
This, (36), and (38) ensure for all t ∈ [0, T ], s ∈ [t, T ], x ∈ Rd that
h i
x,2 x,1
E u2 (s, Xt,s ) − u1 (s, Xt,s )
βp h i
1 x,1 x,2 x,1 x,2 x,2 q
≤ T − 2 bE ϕ(Xt,T ) + ϕ(Xt,T ) Xt,T − Xt,T + δE (ϕ(Xt,T ))
Z T h i q
Z T h i
x,2 q x,2 x,2 x,1
+δ E ϕ(Xt,r ) + u2 (r, Xt,r ) dr + c E u2 (r, Xt,r ) − u1 (r, Xt,r ) dr
s s
βp !
1 x,1 x,2 x,1 x,2
+ bT − 2 sup E ϕ(Xt,r ) + ϕ(Xt,r ) Xt,r − Xt,r
r∈[s,T ] (54)
βp !
1 x,1 x,2 x,1 x,2
≤ 2bT − 2 sup E ϕ(Xt,T ) + ϕ(Xt,T ) Xt,T − Xt,T
r∈[s,T ]
! !
h i h qi
x,2 q x,2
+ δ(T − s + 1) sup E (ϕ(Xt,r )) + δ(T − s) sup E u2 (r, Xt,r )
r∈[s,T ] r∈[s,T ]
Z T h i
x,2 x,1
+c E u2 (r, Xt,r ) − u1 (r, Xt,r ) dr.
s
This, Lemma 2.1 (ii) (applied with κ x q in the notation of Lemma 2.1), (52), and Lemma 2.3 hence
11
ensure for all x ∈ Rd , t ∈ [0, T ], s ∈ [t, T ] that
h i 4 3 2 1 1
x,2 x,1
E u2 (s, Xt,s ) − u1 (s, Xt,s ) ≤ δ8e(q((2q−1)c +3c ))(T −t)+c +c+ 2 (ϕ(x))q+ 2
4
q
+3c3 )(T −t)
+ δ(T − s + 1)e 2 ((q−1)c (ϕ(x))q
q 4 3 q−1 q q
c (T −t)q
+ δ(T − s)2q bq e 4 ((q−1)c +3c )(T −t)+2 (ϕ(x)) 2
Z T h i
x,2 x,1 (55)
+c E u2 (r, Xt,r ) − u1 (r, Xt,r ) dr
s
4 3
q+2 q q q
T q +(c+1)2 1
≤ δ2 b (T + 1)e(q((2q−1)c +3c ))(T −t)+2 c (ϕ(x))q+ 2
Z T h i
x,2 x,1
+c E u2 (r, Xt,r ) − u1 (r, Xt,r ) dr
s
This and Gronwall’s inequality ensure for all x ∈ Rd , t ∈ [0, T ], s ∈ [t, T ] that
h i 4 3 q q q 2 1
x,2 x,1
E u2 (s, Xt,s ) − u1 (s, Xt,s ) ≤ δ2q+2 bq (T + 1)eq(2qc +3c )(T −t)+2 c T +(c+1) (ϕ(x))q+ 2 . (56)
The proof is thus finished.
2.3 A full error estimate

The following corollary combines Lemma 2.4 with [22, Theorem 4.2] and yields an error estimate for
MLP-approximations, where the PDE coefficients are perturbed.
Corollary 2.5. Assume Setting 2.2, let Θ = ∪n∈N Zn , let tθ : Ω → [0, 1], θ ∈ Θ, be i.i.d. random variables
satisfying for all t ∈ (0, 1) that P(t0 ≤ t) = t, let Tθ : [0, T ] × Ω → [0, T ], θ ∈ Θ, satisfy for all t ∈ [0, T ]
that Tθt = t + (T − t)tθ , let W θ : [0, T ] × Ω → Rd , θ ∈ Θ, be independent standard (Ft )t∈[0,T ] -Brownian
motions, assume indepence of (W θ )θ∈Θ and (tθ )θ∈Θ , for every h N ∈ iN, θ ∈ Θ, x ∈ Rd , t ∈ [0, T ] let
N,θ,x (n+1)T N,θ,x
Yt : [t, T ] × Ω → Rd satisfy for all n ∈ {0, 1, ..., N }, s ∈ nT
N , N ∩ [t, T ] that Yt,t = x and
N,θ,x N,θ,x N,θ,x nT N,θ,x θ θ

Yt,s − Yt,max{t, nT = µ2 (Yt,max{t, nT )(s − max{t, }) + σ2 (Yt,max{t, nT )(Ws − Wmax{t, nT } ), (57)
N } N } N N } N
θ
let Un,M : [0, T ] × Rd × Ω → R, n, M ∈ Z, θ ∈ Θ, satisfy for all θ ∈ Θ, n ∈ N0 , M ∈ N, t ∈ [0, T ], x ∈ Rd
that
Mn
θ 1N (n) X
M M ,(θ,0,−i),x

Un,M (t, x) = g 2 Yt,T
M n i=1
 n−l  (58)
n−1
X T − t MX (θ,l,i)
M
− 1N (l)F2 Ul−1,M
(θ,−l,i) (θ,l,i) M ,(θ,l,i),x
+  F2 Ul,M Tt , Y (θ,l,i) .
M n−l i=1 t,Tt
l=0
Then it holds for all n ∈ N0 , M ∈ N, t ∈ [0, T ], x ∈ Rd that

h 2
i 21 4 3 q q q 2
0
E Un,M (t, x) − u1 (t, x) ≤ 4q bq c2 (T + 1)eq(qc +3c )T +2 c T +(c+1) (ϕ(x))q
" # (59)
exp(2ncT + M 2 ) 1
· δ+ n + M .
M2 M 2
12
Proof. First, [22, Theorem 4.2(iv)] (applied with f x f2 , g x g2 , µ x µ2 , σ x σ2 in the notation of [22,
Theorem 4.2(iv)]) ensures for all t ∈ [0, T ], x ∈ Rd , n ∈ N0 , M ∈ N that
" #
h
0 2
i 21 exp(2ncT + M 2 ) 1 β+1 3
E Un,M (t, x) − u2 (t, x) ≤ n + M 12bc2 |ϕ(x)| p e9c T . (60)
M2 M 2
This, the triangle inequality, and Lemma 2.4 ensure for all t ∈ [0, T ], x ∈ Rd , n ∈ N0 , M ∈ N that
h 2
i 12 h 2
i 21
0 0
E Un,M (t, x) − u1 (t, x) ≤ E Un,M (t, x) − u2 (t, x) + |u2 (t, x) − u1 (t, x)|
" #
exp(2ncT + M 2 ) 1 β+1 3
≤ n + M 12bc2 |ϕ(x)| p e9c T
M2 M 2
+ δ2q+2 bq (T + 1)eq(2qc
4
+3c3 )(T −t)+2q cq T q +(c+1)2 1
(ϕ(x))q+ 2 (61)
4 3 q q q 2
q+ 21
≤ 4q bq c2 (T + 1)eq(2qc +3c )T +2 c T +(c+1) (ϕ(x))
" #
exp(2ncT + M 2 ) 1
· δ+ n + M .
M2 M 2
3 Deep Neural Networks

In this section we will prove that multilevel Picard approximations can be represented by deep neural
networks (DNNs), allowing us to construct a DNN we can apply the results established in Section 2
to. This generalizes [23, Lemma 3.10] in the sense that we additionally establish a DNN representation
of Euler-Maruyama approximations. We will first briefly introduce artificial neural networks from a
mathematical point of view. There are many types of artificial neural network architectures, in this
article however, we only consider feed-forward neural network architectures. Mathematically, a feed-
forward neural network of depth L + 1 ∈ N represents a composition of functions
TL ◦ A ◦ TL−1 ◦ A ◦ TL−2 ◦ A ◦ · · · ◦ T2 ◦ A ◦ T1 , (62)
where T1 , ..., TL are affine-linear mappings and A refers to a nonlinear mapping that is applied componen-
twise, which is called activation function. Note that the last affine-linear mapping TL is not followed by an
activation function. For every k ∈ {1, ..., L−1} the composition (A◦Tk ) is called hidden layer, while TL is
called output layer. Similarly, we call the input of the function in (62) as input layer. We will only consider
compositions, that are endowed with the rectified linear unit (ReLU) activation function, i.e., for all d ∈ N
we have Ad : Rd → Rd satisfying for all x = (x1 , ..., xd ) ∈ Rd that Ad (x) = (max{0, x1 }, ..., max{0, xd }).
In this case, the function in (62) is characterized by the affine-linear
mappings involved, which in turn are
of the form Tk (x) = Wk x+Bk , where (Wk , Bk ) ∈ Rlk ×lk−1 × Rlk , x ∈ Rlk−1 , k ∈ {1, ..., L}, l0 , ..., lL ∈ N,
and thus the function in (62) is characterized by (W1 , B1 ) , ..., (WL , BL ). We refer to the entries of the
matrices Wk , k ∈ {1, ..., L}, as weights, to the entries of the vectors Bk , k ∈ {1, ..., L}, as biases, and to
the collection ((W1 , B1 ) , ..., (WL , BL )) as neural network. If L ∈ N ∩ [2, ∞), we call the network deep.
The set of all deep neural networks in our consideration can thus be defined as
"H+1 #
[ [ Y
kn ×kn−1 kn
N= R ×R . (63)
H∈N (k0 ,...,kH+1 )∈NH+2 n=1
13
For a given Φ ∈ N, we refer to its corresponding function with R(Φ).
In practice DNNs are employed in a host of machine learning related applications. In supervised learning
the goal is to choose a Φ ∈ N in such a way, that for a given set of datapoints (xp , yp )p∈[1,P ]∩N :
[1, P ] ∩ N → Rd × Rm , d, m, P ∈ N, we have for all p ∈ [1, P ] ∩ N that (R(Φ))(xp ) ≈ yp . Typically,
one fixes the shape for a network and then aims to minimize a loss function, usually the sum of squared
errors, with respect to the choice of weights and biases. Due to the large number of parameters involved
in even small networks, this minimization problem is of high dimension and is additionally highly non-
convex. Therefore, classical minimization techniques such as gradient descent or Newton’s method are
not suitable. To overcome this issue, stochastic gradient descent methods are employed. These seem to
work well, although there exist relatively few theoretical results which prove convergence of stochastic
gradient descent methods, see, e.g., [14]. For an overview on deep learning-based algorithms for PDE
approximations, see [5].
3.1 A mathematical framework for deep neural networks

d d
Setting 3.1. Let k·k , |||·||| : (∪
qd∈N R ) → [0, ∞) and dim : (∪d∈N R ) → N satisfy for all d ∈ N, x =
P d
(x1 , ..., xd ) ∈ Rd that kxk = 2 d d
i=1 (xi ) , |||x||| = maxi∈[1,d]∩N |xi |, and dim(x) = d, let Ad : R → R ,
d ∈ N, satisfy for all d ∈ N, x = (x1 , ..., xd ) ∈ Rd that
Ad (x) = (max{x1 , 0}, ..., max{xd , 0}), (64)
let D = ∪H∈N NH+2 , let
"H+1 #
[ [ Y
N= (Rkn ×kn−1 × Rkn ) , (65)
H∈N (k0 ,...,kH+1 )∈NH+2 n=1
let D : N → D, P : N → N, and R : N →h (∪k,l∈N C(Rk , Rl )) satisfy

i for all H ∈ N, k0 , k1 , ..., kH , kH+1 ∈
QH+1 kn ×kn−1
N, Φ = ((W1 , B1 ), ..., (WH+1 , BH+1 )) ∈ n=1 (R × R ) , x0 ∈ Rk0 , ..., xH ∈ RkH with ∀n ∈
kn
N ∩ [1, H] : xn = Akn (Wn xn−1 + Bn ) that

H+1
X
P(Φ) = kn (kn−1 + 1), D(Φ) = (k0 , k1 , ..., kH , kH+1 ), (66)
n=1
R(Φ) ∈ C(Rk0 , RkH+1 ) and (R(Φ))(x0 ) = WH+1 xH + BH+1 . (67)

Setting 3.2. Assume Setting 3.1, let ⊙ : D×D → D satisfy for all H1 , H2 ∈ N, α = (α0 , α1 , ..., αH1 , αH1 +1 ) ∈
NH1 +2 , β = (β0 , β1 , ..., βH2 , βH2 +1 ) ∈ NH2 +2 that
α ⊙ β = (β0 , β1 , ..., βH2 , βH2 +1 + α0 , α1 , ..., αH1 +1 ) ∈ NH1 +H2 +3 , (68)
let ⊞ : D × D → D satisfy for all H ∈ N, α = (α0 , α1 , ..., αH , αH+1 ) ∈ NH+2 , β = (β0 , β1 , ..., βH , βH+1 ) ∈
NH+2 that
α ⊞ β = (α0 , α1 + β1 , ..., αH + βH , βH+1 ), (69)
let ndn ∈ D, n ∈ [3, ∞) ∩ N, d ∈ N, satisfy for all n ∈ [3, ∞) ∩ N, d ∈ N that
ndn = (d, 2d, ........., 2d, d) ∈ Nn , (70)
| {z }
(n−2)-times
and let nn ∈ D, n ∈ [3, ∞) ∩ N, satisfy for all n ∈ [3, ∞) ∩ N that

nn = n1n . (71)
14
3.2 Properties of the proposed deep neural networks
For the proof of our main result in this section, Lemma 3.10, we need several auxiliary lemmas, which
are presented in this section. The proofs of Lemmas 3.3-3.5 appear in [23] and are therefore omitted.
Lemma 3.3. ([23, Lemma 3.5]) Assume Setting 3.2, let H, k, l ∈ N, and let α, β ∈ {k} × NH × {l}. Then
it holds that |||α ⊞ β||| ≤ |||α||| + |||β|||.
Lemma 3.4. ([23, Lemma 3.8]) Assume Setting 3.2 and let d1 , d2 , d3 ∈ N, f ∈ C(Rd2 , Rd3 ), g ∈
C(Rd1 , Rd2 ), α, β ∈ D satisfy that f ∈ R({Φ ∈ N : D(Φ) = α}) and g ∈ R({Φ ∈ N : D(Φ) = β}).
Then it holds that (f ◦ g) ∈ R({Φ ∈ N : D(Φ) = α ⊙ β}).
Lemma 3.5. ([23, Lemma 3.9]) Assume Setting 3.2 and let M, H, p, q ∈ N, h1 , h2 , ..., hM ∈ R, ki ∈
D, fi ∈ C(Rp , Rq ), i ∈ [1, M ] ∩ N, satisfy for all i ∈ [1, M ] ∩ N that dim(ki ) = H + 2 and fi ∈
R ({Φ ∈ N : D(Φ) = ki }) . Then it holds that
M
X
hi f i ∈ R Φ ∈ N : D(Φ) = ⊞M
i=1 ki . (72)
i=1
The following lemma extends [24, Lemma 5.4] as well as [23, Lemma 3.6], and proves the existence of a
DNN of arbitrary length representing the identity function.
Lemma 3.6. Assume Setting 3.2 and let d, H ∈ N. Then it holds that
IdRd ∈ R({Φ ∈ N : D(Φ) = ndH+2 }). (73)

1

Proof. Let w1 = −1 and let W1 ∈ R2d×d , B1 ∈ R2d satisfy
 
w1 0 ... 0
0 w1 ... 0 
 ∈ R2d×d , B1 = 0R2d ∈ R2d ,
 
W1 =  . .. .. .. (74)
 .. . . . 
0 0 . . . w1
for all l ∈ [2, H] ∩ N let Wl ∈ R2d×2d , Bl ∈ R2d satisfy
Wl = IdR2d ∈ R2d×2d , Bl = 0R2d ∈ R2d , (75)
and let WH+1 ∈ Rd×2d , BH+1 ∈ Rd satisfy
WH+1 = W1⊤ ∈ Rd×2d , BH+1 = 0Rd ∈ Rd , (76)
let φ ∈ N satisfy that φ = ((W1 , B1 ), (W2 , B2 ), ..., (WH , BH ), (WH+1 , BH+1 )), for every a ∈ R let a+ =
max{a, 0}, and let x0 = (x01 , ..., x0d ) ∈ Rd , x1 , x2 , ..., xH ∈ R2d satisfy for all n ∈ N ∩ [1, H] that
xn = A2d (Wn xn−1 + Bn ). (77)
This implies that D(φ) = ndH+2 . Furthermore, it holds that

⊤
x1 = A2d (W1 x0 + B1 ) = A2d (x01 , −x01 , ..., x0d , −x0d )⊤ = (x01 )+ , (−x01 )+ , ..., (x0d )+ , (−x0d )+ ,
⊤
x2 = A2d (W2 x1 + B2 ) = A2d (x01 )+ , (−x01 )+ , ..., (x0d )+ , (−x0d )+ (78)
⊤
= (x01 )+ , (−x01 )+ , ..., (x0d )+ , (−x0d )+ ,
15
and continuing this procedure yields
⊤
xH = A2d (WH xH−1 + BH ) = A2d (x01 )+ , (−x01 )+ , ..., (x0d )+ , (−x0d )+
⊤ (79)
= (x01 )+ , (−x01 )+ , ..., (x0d )+ , (−x0d )+ .
It thus holds that

(R(φ)) (x0 ) = WH+1 xH + BH+1 = x0 . (80)
0 d
arbitrarily, that R(φ) = IdRd . This and the fact that it holds that
This shows, since x ∈ R was chosen
D(φ) = ndH+2 prove that IdRd ∈ R Φ ∈ N : D(Φ) = ndH+2 . The proof is thus completed.
The following auxiliary lemma states that if a given function can be represented by a DNN of certain
depth, then it can be represented by a DNN of any larger depth.
Lemma 3.7. Assume Setting 3.2, let H, p, q ∈ N and let g ∈ C(Rp , Rq ) satisfy that g ∈
R ({Φ ∈ N : dim(D(Φ)) = H + 2}). Then for all n ∈ N0 we have
g ∈ R ({Φ ∈ N : dim(D(Φ)) = H + 2 + n}) . (81)

hQ i
H+1 kn ×kn−1
Proof. Let k0 , k1 , ..., kH , kH+1 ∈ N, Φg = ((W1 , B1 ), ..., (WH+1 , BH+1 )) ∈ n=1 (R × Rkn ) sat-
isfy for all x0 ∈ Rk0 , ..., xH ∈ RkH with ∀n ∈ N ∩ [1, H] : xn = Akn (Wn xn−1 + Bn ) that (R(Φg ))(x0 ) =
WH+1 xH + BH+1 = g(x0 ). For n = 0, the claim is clear. For the case that n ∈ N ∩ [2, ∞), Lemma 3.6 (
applied with H x n − 1 in the notation of Lemma 3.6) proves that IdRq ∈ R({Φ ∈ N : D(Φ) = nqn+1 }).
This and Lemma 3.4 (applied with d1 x p, d2 x q, d3 x q, g x g, f x IdRq , α x nqn+1 , β x D(Φg ) in
the notation of Lemma 3.4) ensure that g ∈ R({Φ ∈ N : dim(D(Φ)) = H + 2 + n}).
For the case n = 1, let Ψ = ((W1 , B1 ), ..., (WH , BH ), (IdRkH , 0RkH ), (WH+1 , BH+1 )). This and the fact
that for all x ∈ R it holds that max{0, x} = max{0, max{0, x}} ensures for all x0 ∈ Rk0 , ..., xH ∈ RkH
with ∀n ∈ N ∩ [1, H] : xn = Akn (Wn xn−1 + Bn ) that
(R(Φg ))(x0 ) = WH+1 AkH (WH xH−1 + BH ) + BH+1

= WH+1 AkH (AkH (WH xH−1 + BH )) + BH+1
(82)
= WH+1 AkH (0RkH + IdRkH AkH (WH xH−1 + BH )) + BH+1
= (R(Ψ))(x0 ).
This combined with the fact that dim(D(Ψ)) = dim(D(Φg )) + 1 completes the proof.
3.3 A DNN representation for time-recursive functions

Lemma 3.8. Assume Setting 3.2, let d, m, K ∈ N, T ∈ (0, ∞), let τ0 , τ1 , ...τK ∈ [0, T ] satisfy 0 ≤
τ0 ≤ ... ≤ τK ≤ T, let σ ∈ C(Rd , Rd×m ), for all v ∈ Rd let Φσ,v ∈ N satisfy that R(Φσ,v ) = σ(·) · v and
D(Φσ,v ) = D(Φσ,0 ), let x·y : R → R satisfy for all t ∈ R that xty = max({τ0 , τ1 , ..., τK } ∩((−∞, t)∪{τ0 })),
let f : R → Rm , for all s ∈ [0, T ] let gs : Rd → Rd satisfy for all x ∈ Rd that g0 (x) = x,
gs (x) = gxsy (x) + σ(gxsy (x))(f (s) − f (xsy)). (83)
Then for all s ∈ [0, T ] there exists a gs ∈ N such that

(i) R(gs ) = gs ,
K h i
(ii) D(gs ) = ⊙ nddim(D(Φσ ,0)) ⊞ D(Φσ,0 ) .
l=1
16
Proof. Note that for all s ∈ [0, T ], k ∈ [1, K] ∩ N we have

τk−1 , s ≤ τk−1

(s ∨ τk−1 ) ∧ τk = s , s ∈ (τk−1 , τk ] . (84)


τk , s > τk
This ensures for all s ∈ [0, T ], x ∈ Rd that

K
X
gs (x) = x + σ(gτk−1 (x)) [f ((s ∨ τk−1 ) ∧ τk ) − f (τk−1 )] . (85)
k=1
Next for all s ∈ [0, T ], k ∈ [1, K] ∩ N let Φks ∈ C(Rd , Rd ) satisfy for all x ∈ Rd that
Φks (x) = x + σ(x) [f ((s ∨ τk−1 ) ∧ τk ) − f (τk−1 )] . (86)
For all k ∈ [1, K] ∩ N, s ∈ [0, T ] let Ψks ∈ C(Rd , Rd ) satisfy
Ψks = Φks ◦ Φsk−1 ◦ · · · ◦ Φ1s . (87)
Note that for all k ∈ [1, K − 1] ∩ N, s ≤ τk−1 it holds Φks = IdRd . This ensures for all k ∈ [1, K − 1] ∩ N, n ∈
[k + 1, K] ∩ N, s ≤ τk that
Ψks = Ψns , (88)
and in particular Ψks = ΨK d k
s . Observe that for all k ∈ [1, K] ∩ N, s ≤ τk , x ∈ R we have Ψs (x) = gs (x)
d K
and thus, for all s ∈ [0, T ], x ∈ R it holds Ψs (x) = gs (x).
Next, Lemma 3.6 (applied with d x d, H x dim(D(Φσ,0 )) − 2 in the notation of Lemma 3.6) ensures for
all s ∈ [0, T ] that n o
IdRd ∈ R Φ ∈ N : D(Φ) = nddim(D(Φσ,0 )) . (89)
Note that Lemma 3.5 (applied for all s ∈ [0, T ], k ∈ [1, K] ∩ N with M x 2, p x d, q x d, H x
dim(D(Φσ,0 )), h1 x 1, h2 x 1, f1 x IdRd , f2 x σ(·) [f ((s ∨ τk−1 ) ∧ τk ) − f (τk−1 )], k1 x nddim(D(Φσ,0 )) , k2 x
D(Φσ,0 ) in the notation of Lemma 3.5) proves for all s ∈ [0, T ], k ∈ [1, K] ∩ N that
n o
Φks ∈ R Φ ∈ N : D(Φ) = nddim(D(Φσ,0 )) ⊞ D(Φσ,0 ) . (90)
This and Lemma 3.4 (applied for all s ∈ [0, T ] with d1 x d, d2 x d, d3 x d, f x Φ2s , g x Φ1s ,
α x nddim(D(Φσ,0 )) ⊞ D(Φσ,0 ), β x nddim(D(Φσ,0 )) ⊞ D(Φσ,0 ) in the notation of Lemma 3.4) show for all
s ∈ [0, T ] that
i
2 h
Ψ2s = Φ2s ◦ Φ1s ∈ R Φ ∈ N : D(Φ) = ⊙ nddim(D(Φσ,0 )) ⊞ D(Φσ,0 ) . (91)
l=1
This and Lemma 3.4 (applied for all s ∈ [0, T ] with d1 x d, d2 x d, d3 x d, f x Φ3s , g x Ψ2s ,
2 h i
α x nddim(D(Φσ,0 )) ⊞ D(Φσ,0 ), β x ⊙ nddim(D(Φσ,0 )) ⊞ D(Φσ,0 ) in the notation of Lemma 3.4) show for
l=1
all s ∈ [0, T ] that
i
3 h
Ψ3s = Φ3s ◦ Ψ2s ∈R d
Φ ∈ N : D(Φ) = ⊙ ndim(D(Φσ,0 )) ⊞ D(Φσ,0 ) . (92)
l=1
17
Continuing this procedure hence demonstrates that for all s ∈ [0, T ], k ∈ [1, K] ∩ N it holds
i
k h
k d
Ψs ∈ R Φ ∈ N : D(Φ) = ⊙ ndim(D(Φσ,0 )) ⊞ D(Φσ,0 ) . (93)
l=1
This and the fact that for all s ∈ [0, T ], x ∈ Rd it holds ΨK

s (x) = gs (x) establishes for all s ∈ [0, T ] the
existence of gs ∈ N satisfying for all x ∈ Rd that R(gs ) ∈ C(Rd , Rd ) and
K h i
D(gs ) = ⊙ nddim(D(Φσ,0 )) ⊞ D(Φσ,0 ) . (94)
l=1
3.4 A DNN representation of Euler-Maruyama approximations

In the following lemma we construct a DNN that represents an Euler-Maruyama approximation. This
construction is similar to the one developed in [18], where a DNN representation of Euler approximations
for ordinary differential equations is constructed.
Lemma 3.9. Assume Setting 3.2, let d, K ∈ N, T ∈ (0, ∞), τ0 , ..., τK ∈ [0, T ] satisfy 0 ≤ τ0 ≤ ... ≤ τK ≤
T, let µ ∈ C(Rd , Rd ), σ ∈ C(Rd , Rd×d ), for all v ∈ Rd let Φµ , Φσ,v ∈ N satisfy R(Φµ ) = µ, R(Φσ,v ) =
σ(·)v, D(Φσ,v ) = D(Φσ,0 ), let (Ω, F , P, (Ft )t∈[0,T ] ) be a filtered probability space satisfying the usual con-
ditions, let Θ = ∪n∈N Zn , let Wθ : Ω × [0, T ] → Rd , θ ∈ Θ, be i.i.d. standard (Ft )t∈[0,T ] -Brownian
motions, let x·y : R → R satisfy for all t ∈ R that xty = max({τ0 , ...τK } ∩ ((−∞, t) ∪ {τ0 })), for all
t ∈ [0, T ], x ∈ Rd , θ ∈ Θ let Ytθ,x : [t, T ] × Ω → Rd , be measurable and satisfy for all s ∈ [t, T ] that
θ,x
Yt,t = x and
θ,x θ,x θ,x θ,x
Yt,s − Yt,max{t,xsy} = µ(Yt,max{t,xsy} )(s − max{t, xsy}) + σ(Yt,max{t,xsy} )(Wsθ − Wmax{t,xsy}
θ
), (95)
θ
and let ω ∈ Ω. Then there exists a family (Yt,s )t,s∈[0,T ],θ∈Θ ⊆ N such that
θ θ,x
(i) for all t ∈ [0, T ], s ∈ [t, T ], θ ∈ Θ it holds that R(Yt,s ) ∈ C(Rd , Rd ) and (R(Yt,s
θ
))(x) = Yt,s (ω),
(ii) for all t1 , t2 ∈ [0, T ], s1 ∈ [t1 , T ], s2 ∈ [t2 , T ], θ1 , θ2 ∈ Θ it holds that D(Ytθ11,s1 ) = D(Ytθ22,s2 ),
θ
(iii) for all t ∈ [0, T ], s ∈ [t, T ], θ ∈ Θ it holds that dim(D(Yt,s )) = K(max{dim(D(Φµ )), dim(D(Φσ,0 ))}−
1),
θ
(iv) for all t ∈ [0, T ], s ∈ [t, T ], θ ∈ Θ it holds that D(Yt,s ) ≤ max{2d, |||D(Φµ )|||, |||D(Φσ,0 )|||}.
Proof. Without loss of generality assume that dim(Φµ ) = dim(D(Φσ,0 )), otherwise we may refer to
Lemma 3.7. Let Σ ∈ C(Rd , Rd×(d+1) ) satisfy for all x ∈ Rd , i ∈ {1, ..., d}, j ∈ {1, ..., d + 1} that
(
µi (x) ,j = 1
(Σ(x))i,j = . (96)
σi,j−1 (x) , else
Hence, for all v = (v1 , v ′ ) ∈ R × Rd , x ∈ Rd it holds that

Σ(x)v = µ(x)v1 + σ(x)v ′ . (97)
This and Lemma 3.5 (applied for all t ∈ [0, T ], θ ∈ Θ, v = (v1 , v ′ ) ∈ R × Rd with M x 2, H x
dim(D(Φσ,0 )), p x d, q x d, h x v1 , h2 x 1, k1 x D(Φµ ), k2 x D(Φσ,0 ), f1 x µ, f2 x σ(·)v ′ in the
notation of Lemma 3.5) ensure for all t ∈ [0, T ], θ ∈ Θ, v = (v1 , v ′ ) ∈ R × Rd that
Σ(·)v ∈ R ({Φ ∈ N : D(Φ) = D(Φµ ) ⊞ D(Φσ,0 )}) . (98)
18
For all θ ∈ Θ let f θ : R → Rd+1 satisfy for all t ∈ [0, T ] that

t
f θ (t) = . (99)
Wtθ (ω)
Observe that for all x ∈ Rd , t ∈ [0, T ], θ ∈ Θ it holds
Σ(x)f θ (t) = µ(x)t + σ(x)Wtθ (ω). (100)
Hence Lemma 3.8 (applied for all t ∈ [0, T ], s ∈ [t, T ]θ ∈ Θ with d x d, m x d + 1, K x K, τ0 x τ0 ∨ t,
θ,·
τ1 x τ1 ∨ t,..., τK x τK ∨ t, σ x Σ, f x f θ , gs x Yt,s in the notation of Lemma 3.8) shows for all
θ
t ∈ [0, T ], s ∈ [t, T ], θ ∈ Θ the existence of a Yt,s ∈ N such that
θ θ,x
(I) R(Yt,s ) ∈ C(Rd , Rd ), satisfying for all x ∈ Rd that (R(Yt,s
θ
))(x) = Yt,s (ω),
K h i
θ
(II) D(Yt,s ) = ⊙ nddim(D(Φσ,0 )) ⊞ D(Φµ ) ⊞ D(Φσ,0 ) .
l=1
θ
Observe that D(Yt,s ) does not depend on t, s or θ, which thus implies item (ii). Item (iii) follows from
item (II) and the definition of ⊙.
Note that for all H1 , H2 , α0 , ..., αH1 +1 , β0 , ..., βH2 +1 ∈ N, α, β ∈ D satisfying α = (α0 , ..., αH1 +1 ) and
β = (β0 , ..., βH2 +1 ) with α0 = βH2 +1 it holds that |||α ⊙ β||| ≤ max{|||α|||, |||β|||, 2α0 }. This combined with
item (II) imply item (iv). The proof is thus completed.
3.5 A DNN representation of MLP approximations

In the following lemma we establish a central result of this article: MLP approximations can be repre-
sented by DNNs. With the help of Lemma 3.9 we extend [23, Lemma 3.10], where the special case of
MLP approximations for semilinear heat equations is treated.
Lemma 3.10. Assume Setting 3.2, let c, d, K, M ∈ N, T ∈ (0, ∞), let τ0 , ..., τK ∈ [0, T ] satisfy 0 = τ0 ≤
τ1 ≤ ... ≤ τK = T , let µ ∈ C(Rd , Rd ), σ = (σi,j )i,j∈{1,...,d} ∈ C(Rd , Rd×d ), f ∈ C(R, R), g ∈ C(Rd , R),
for all v ∈ N let Φµ , Φσ,v , Φf , Φg ∈ N satisfy R(Φµ ) = µ, R(Φf ) = f, R(Φg ) = g, R(Φσ,v ) = σ(·)v,
D(Φσ,v ) = D(Φσ,0 ),
c ≥ max{2d, |||D(Φf )|||, |||D(Φg )|||, |||D(Φµ )|||, |||D(Φσ,0 )|||}, (101)
let (Ω, F , P, (Ft )t∈[0,T ] ) be a filtered probability space satisfying the usual conditions, let Θ = ∪n∈N Zn ,
let tθ : Ω → [0, 1], θ ∈ Θ, be i.i.d. random variables, assume for all t ∈ (0, 1) that P(t0 ≤ t) = t, let
Tθ : [0, T ]×Ω → [0, T ] satisfy for all θ ∈ Θ, t ∈ [0, T ] that Tθt = t+(T −t)tθ , let W θ : [0, T ]×Ω → Rd , θ ∈ Θ,
be i.i.d. standard (Ft )t∈[0,T ] -Brownian motions, assume that (tθ )θ∈Θ and (W θ )θ∈Θ are independent, let
⌊·⌋ : R → R satisfy for all t ∈ R that ⌊t⌋ = max ({τ0 , τ1 , ..., τK } ∩ ((−∞, t) ∪ {τ0 })), let Ytθ,x : [t, T ] × Ω →
Rd , θ ∈ Θ, t ∈ [0, T ], x ∈ Rd be measurable, satisfying for all t ∈ [0, T ], x ∈ Rd , s ∈ [t, T ], θ ∈ Θ, ω ∈ Ω
θ,x
that Yt,t = x and
θ,x θ,x θ,x θ,x
Yt,s − Yt,max{t,⌊s⌋} = µ(Yt,max{t,⌊s⌋} )(s − max{t, ⌊s⌋}) + σ(Yt,max{t,⌊s⌋} )(Wsθ − Wmax{t,⌊s⌋}
θ
) (102)
let Unθ : [0, T ] × Rd × Ω → R, n ∈ Z, θ ∈ Θ, satisfy for all θ ∈ Θ, n ∈ N0 , t ∈ [0, T ], x ∈ Rd that

n
1N (n) X
M
(θ,0,−i),x
Unθ (t, x) = g(Yt,T )
M n i=1
 n−l  (103)
n−1
X (T − t) MX
− 1N (l)f ◦ Ul−1
(θ,l,i) (θ,−l,i) (θ,l,i) (θ,l,i),x
+  (f ◦ Ul )(Tt , Y (θ,l,i) ) ,
M n−l i=1
t,Tt
l=0
19
and let ω ∈ Ω. Then for all n ∈ N0 there exists a family (Φθn,t )θ∈Θ,t∈[0,T ] ⊆ N such that
(i) it holds for all t1 , t2 ∈ [0, T ], θ1 , θ2 ∈ Θ that
D(Φθn,t
1
1
) = D(Φθn,t
2
2
) (104)
(ii) it holds for all t ∈ [0, T ], θ ∈ Θ that
dim(D(Φθn,t )) = (n + 1)K(max{dim(D(Φµ )), dim(D(Φσ,0 )))} − 1) − 1

(105)
+ n (dim(D(Φf )) − 2) + dim(D(Φg )),
(iii) it holds for all t ∈ [0, T ], θ ∈ Θ that
D(Φθn,t ) ≤ c(3M )n , (106)
(iv) it holds for all t ∈ [0, T ], θ ∈ Θ, x ∈ Rd that
Unθ (t, x, ω) = (R(Φθn,t ))(x). (107)
Proof. Throughout this proof let Y ∈ N satisfy
Y = K(max{dim(D(Φµ )), dim(D(Φσ,0 ))} − 1). (108)
We will prove this lemma via induction on n ∈ N. For n = 0 we have for all θ, ∈ Θ, x ∈ Rd , t ∈ [0, T ] that
U0θ (t, x, ω) = 0. By choosing all weights and biases to be 0, the 0-function can be represented by deep
neural networks of arbitrary shape. Therefore items (i)-(iv) hold true for n = 0. For the induction step
θ
n → n + 1 we will construct Un+1 (t, x), θ ∈ Θ, t ∈ [0, T ], x ∈ Rd , via DNN functions, which we achieve
by representing each summand of (103) by a DNN, and then, in order to be able to utilize Lemma 3.5,
compose each DNN with a suitable sized network representing the identity function on R. Lemma 3.9
θ
yields the existence of a family (Yt,s )θ∈Θ,s,t∈[0,T ] ⊆ N satisfying for all θ ∈ Θ, t ∈ [0, T ], s ∈ [t, T ] that
θ θ,x θ θ θ
(R(Yt,s ))(x) = Yt,s (ω), dim(D(Yt,s )) = Y, D(Yt,s ) ≤ max{2d, |||D(Φµ )|||, |||D(Φσ,0 )|||}, and D(Yt,s )=
0 d
D(Y0,0 ). Lemma 3.4 (applied for all θ ∈ Θ, x ∈ R with d1 x d, d2 x d, d3 x 1, α x D(Φg ), β x
θ θ,x
D(Yt,T ), f x g, g x Yt,T in the notation of Lemma 3.4) ensures for all θ ∈ Θ, x ∈ Rd , that
θ,x θ

g ◦ Yt,T ∈R Φ ∈ N : D(Φ) = D(Φg ) ⊙ D(Yt,T ) . (109)
Note that Lemma 3.6 (applied with H x (n + 1)(dim(D(Φf )) − 2 + Y) + 1 in the notation of Lemma 3.6)
ensures that
IdR ∈ R Φ ∈ N : D(Φ) = n(n+1)(dim(D(Φf ))−2+Y)+1 . (110)
This, (109), and Lemma 3.4 (applied for all t ∈ [0, T ], θ, ∈ Θ, x ∈ Rd with d1 x d, d2 x 1, d3 x 1, f =
θ,x θ
IdR , g = g ◦ Yt,T , α = n(n+1)(dim(D(Φf ))−2+Y)+1 , β = D(Yt,T ) in the notation of Lemma 3.4) show that
d
for all t ∈ [0, T ], θ ∈ Θ, x ∈ R it holds that
θ,x θ

g ◦ Yt,T ∈R Φ ∈ N : D(Φ) = n(n+1)(dim(D(Φf ))−2+Y)+1 ⊙ D(Φg ) ⊙ D(Yt,T ) . (111)
Next, the induction hypothesis implies for all l ∈ [0, n] ∩ N0 , t ∈ [0, T ], θ ∈ Θ the existence of Φθl,t ∈ N
satisfying R(Φθl,t ) = Ulθ (t, ·, ω) and D(Φθl,t ) = D(Φ0l,0 ). This and Lemma 3.4 (applied for all θ, η ∈ Θ, t ∈
θ,x η
[0, T ], l ∈ [0, n] ∩ N0 with d1 x d, d2 x d, d3 x 1, f x Unη (Tθt (ω), ·, ω), g x Yt,T θ (ω) , α x D(Φl,Tθ ),
t t
20
θ
β x D(Yt,Tθ (ω) ) in the notation of Lemma 3.4) show that for all θ, η ∈ Θ, t ∈ [0, T ], l ∈ [0, n] ∩ N0 it holds
t
that
Ulη Tθt (ω), Yt,T
θ,x
θ (ω), ω = R(Φ η
θ ) (R(Y θ
θ
t,Tt (ω) ))(x)
t (ω) l,T (ω)
n t o
∈ R Φ ∈ N : D(Φ) = D(Φηl,Tθ (ω) ) ⊙ D(Yt,T θ
θ (ω) )
(112)
t t

= R Φ ∈ N : D(Φ) = D(Φ0l,0 ) ⊙ D(Y0,0 0
) .
Note that Lemma 3.6 (applied for all l ∈ [1, n − 1] ∩ N with H x (n − l) (dim(D(Φf )) − 2 + Y)) + 1 in
the notation of Lemma 3.6) shows for all l ∈ [1, n − 1] ∩ N that

IdR ∈ R Φ ∈ N : D(Φ) = n(n−l)(dim(D(Φf ))−2+Y)+1 . (113)
This, (112), and Lemma 3.4

(applied for all θ, η∈ Θ, t ∈ [0, T ], l ∈ [0, n − 1] ∩ N0 with d1 x d, d2 x 1,
d3 x 1, f x IdR , g x Ulη Tθt (ω), Yt,T
θ,x 0 0
θ (ω) (ω), ω , α = n(n−l)(dim(D(Φf ))−2+Y)+1 , β = D(Φl,0 ) ⊙ D(Y0,0 )
t
in the notation of Lemma 3.4) show that for all θ, η ∈ Θ, t ∈ [0, T ], l ∈ [0, n − 1] ∩ N0 it holds that

Ulη Tθt (ω), Yt,T
θ,x
θ (ω) (ω), ω ∈ R Φ ∈ N : D(Φ) = n(n−l)(dim(D(Φf ))−2+Y)+1 ⊙ D(Φ0l,0 ) ⊙ D(Y0,0 0
) .
t
(114)
This and Lemma 3.4 (applied for all θ, η ∈ Θ, t ∈ [0, T ], l ∈ [0, n − 1] ∩ N 0 with d1 x d, d2 x 1, d3 x 1,
η θ θ,x 0 0
f x f , g x Ul Tt (ω), Yt,Tθ (ω) (ω), ω , α x D(Φf ), β x n(n−l)(dim(D(Φf ))−2+Y)+1 ⊙ D(Φl,0 ) ⊙ D(Y0,0 )
t
in the notation of Lemma 3.4) prove that for all η, θ ∈ Θ, t ∈ [0, T ], l ∈ [0, n − 1] ∩ N0 it holds

f ◦ Ulη Tθt (ω), Yt,T
θ,x
θ (ω) (ω), ω

t
(115)
∈ R Φ ∈ N : D(Φ) = D(Φf ) ⊙ n(n−l)(dim(D(Φf ))−2+Y)+1 ⊙ D(Φ0l,0 ) ⊙ D(Y0,0 0
) .
Furthermore, (112) (applied with l x n) and Lemma 3.4(applied for all θ, η ∈ Θ, t ∈ [0, T ] with d1 x d,
θ,x
d2 x 1, d3 x 1, f x f , g x Un Tθt (ω), Yt,T
η 0 0
θ (ω) (ω), ω , α x D(Φf ), β x D(Φn,0 ) ⊙ D(Y0,0 ) in the
t
notation of Lemma 3.4) prove that for all η, θ ∈ Θ, t ∈ [0, T ] it holds

θ,x
f ◦ Unη Tθt (ω), Yt,Tθ (ω) (ω), ω ∈ R Φ ∈ N : D(Φ) = D(Φf ) ⊙ D(Φ0l,0 ) ⊙ D(Y0,00
) . (116)
t
θ
To reiterate what was established so far: For each summand of Un+1 , θ ∈ Θ, existence of a suitable DNN
representation was proven. The next step shows that these representing DNNs all have the same depth:
θ
Note that Lemma 3.9 implies that for all t ∈ [0, T ], s ∈ [t, T ], θ ∈ Θ it holds that Y = dim(D(Yt,s )). The
definition of ⊙, the fact that for all l ∈ [0, n] ∩ N0 we have
dim(D(Φ0l,0 )) = l(dim(D(Φf )) − 2) + (l + 1)Y + dim(D(Φg )) − 1 (117)

θ
in the induction hypothesis, and the fact that for all θ ∈ Θ, t ∈ [0, T ], s ∈ [t, T ] we have dim(D(Yt,s )) = Y
imply that
θ
dim(n(n+1)(dim(D(Φf ))−2+Y)+1 ⊙ D(Φg ) ⊙ D(Yt,T ))
θ
= dim(n(n+1)(dim(D(Φf ))−2+Y)+1 ) + dim(D(Φg )) + dim(D(Yt,T )) − 2
(118)
= (n + 1)(dim(D(Φf )) − 2 + Y) + 1 + dim(D(Φg )) + Y − 2
= (n + 1)(dim(D(Φf )) − 2) + (n + 2)Y + dim(D(Φg )) − 1,
21
that
dim(D(Φf ) ⊙ D(Φ0n,0 ) ⊙ D(Y0,0
0
))
= dim(D(Φf )) + dim(D(Φ0n,0 )) + dim(D(Y0,0
0
)) − 2
(119)
= dim(D(Φf )) + n(dim(D(Φf )) − 2) + (n + 1)Y + dim(D(Φg )) − 1 + Y − 2
= (n + 1)(dim(D(Φf )) − 2) + (n + 2)Y + dim(D(Φg )) − 1,
and for all l ∈ [0, n − 1] ∩ N0 that
dim(D(Φf ) ⊙ n(n−l)(dim(D(Φf ))−2+Y)+1 ⊙ D(Φ0l,0 ) ⊙ D(Y0,0
0
))
= dim(D(Φf )) + dim(n(n−l)(dim(D(Φf ))−2+Y)+1 ) + dim(D(Φ0l,0 )) + dim(D(Y0,0
0
)) − 3
= dim(D(Φf )) + (n − l) (dim(D(Φf )) − 2 + Y) + 1 + l(dim(D(Φf )) − 2) + (l + 1)Y (120)
+ dim(D(Φg )) − 1 + Y − 3
= (n + 1)(dim(D(Φf )) − 2) + (n + 2)Y + dim(D(Φg )) − 1.
θ
Since all DNNs that are necessary to construct Un+1 , θ ∈ Θ, have the same depth, Lemma 3.5 and (103)
imply that there exist (Φn+1,t )t∈[0,T ],θ∈Θ ⊆ N satisfying for all θ ∈ Θ, t ∈ [0, T ], x ∈ Rd it holds that
θ
(R(Φθn+1,t ))(x)
n+1
M
1 X (θ,0,−i),x
= g(Yt,T (ω))
M n+1 i=1
M
(T − t) X (θ,n,i) (θ,n,i)
+ f ◦ Un(θ,n,i) Tt (ω), Y (θ,n,i) (ω), ω
M i=1 t,Tt (ω)
n+1−l
(121)
n−1 MX
X (T − t) (θ,l,i) (θ,l,i) (θ,l,i)
+ f◦ Ul Tt (ω), Y (θ,l,i) (ω), ω
M n+1−l i=1
t,T t (ω)
l=0
n+1−l
n MX
X (T − t) (θ,−l,i) (θ,l,i) (θ,l,i)
− f◦ Ul−1 Tt (ω), Y (θ,l,i) (ω), ω
M n+1−l i=1
t,Tt (ω)
l=1
θ
= Un+1 (t, x, ω),
that
dim(D(Φθn+1,t )) = (n + 1)(dim(D(Φf )) − 2) + (n + 2)Y + dim(D(Φg )) − 1, (122)
and that
!
M n+1
D(Φθn+1,t ) = ⊞ n(n+1)(dim(D(Φf ))−2+Y)+1 ⊙ D(Φg ) ⊙ 0
D(Y0,0 )
i=1

M
⊞ ⊞ D(Φf ) ⊙ D(Φ0n,0 ) ⊙ 0
D(Y0,0 )
i=1
! (123)
n−1 M n+1−l
⊞ ⊞ ⊞ D(Φf ) ⊙ n(n−l)(dim(D(Φf ))−2+Y)+1 ⊙ D(Φ0l,0 ) ⊙ D(Y0,0
0
)
l=0 i=1
!
n M n+1−l
⊞ ⊞ ⊞ D(Φf ) ⊙ n(n−l+1)(dim(D(Φf ))−2+Y)+1 ⊙ D(Φ0l−1,0 ) ⊙ D(Y0,0
0
) .
l=1 i=1
This shows for all t1 , t2 ∈ [0, T ], θ1, θ2 ∈ Θ that

D(Φθn+1,t
1
1
) = D(Φθn+1,t
2
2
). (124)
22
Additionally, (123) and Lemma 3.3 prove for all t ∈ [0, T ], θ ∈ Θ that
n+1
M
X
D(Φθn+1,t ) ≤ 0
n(n+1)(dim(D(Φf ))−2+Y)+1 ⊙ D(Φg ) ⊙ D(Y0,0 )
i=1
M
X
+ D(Φf ) ⊙ D(Φ0n,0 ) ⊙ D(Y0,0
0
)
i=1
n+1−l
(125)
n−1
X MX
+ D(Φf ) ⊙ n(n−l)(dim(D(Φf ))−2+Y)+1 ⊙ D(Φ0l,0 ) ⊙ 0
D(Y0,0 )
l=1 i=1
n+1−l
n MX
X
+ D(Φf ) ⊙ n(n−l+1)(dim(D(Φf ))−2+Y)+1 ⊙ D(Φ0l−1,0 ) ⊙ D(Y0,0
0
) .
l=1 i=1
Note that for all H1 , H2 , α0 , ..., αH1 +1 , β0 , ..., βH2 +1 ∈ N, α, β ∈ D satisfying α = (α0 , ..., αH1 +1 ) and
β = (β0 , ..., βH2 +1 ) with α0 = βH2 +1 it holds that |||α ⊙ β||| ≤ max{|||α|||, |||β|||, 2α0 }. This, (125), the fact
that for all H ∈ N it holds that |||nH+2 ||| = 2, and the fact that for all l ∈ [0, n] ∩ N0 we have
D(Φ0l,0 ) ≤ c(3M )l (126)
in the induction hypothesis show for all t ∈ [0, T ], θ ∈ Θ that

 n+1  " # n−1 M n+1−l   n+1−l

M
X XM X X n MX
X
D(Φθn+1,t ) ≤  c + c(3M )n +  c(3M )l  +  c(3M )l−1 
i=1 i=1 l=0 i=1 l=1 i=1
"n−1 # " n #
X X
n+1 n n+1−l l n+1−l l−1
=M c + M c(3M ) + M c(3M ) + M c(3M )
l=0 l=1
" n−1 n
# " n n
# (127)
X X X X
n+1 n l l−1 n+1 l l−1
=M c 1+3 + 3 + 3 =M c 1+ 3 + 3
l=0 l=1 l=0 l=1
" n
#
n+1
X 3 −1
≤ cM n+1 1 + 2 3l = cM n+1 1 + 2
3−1
l=0
n+1
= c(3M ) .
(121),(122), (124) and (127) complete the induction step. By the principle of induction the proof is thus
completed.
4 Deep neural network approximations for PDEs

4.1 Deep neural network approximations with specific polynomial conver-
gence rates
The following theorem is our main result. It generalizes [23, Theorem 4.1] where the special case of
semilinear heat equations is treated. In the proof we combine the results from Corollary 2.5 and Lemma
3.10.
Theorem 4.1. Assume Setting 3.1, let T ∈ (0, ∞), b, c, p ∈ [1, ∞), B, β ∈ [0, ∞), p ∈ N, α, q ∈ [2, ∞),
let k·kF : (∪d∈N Rd×d ) → [0, ∞) satisfy for all d ∈ N, A = (ai,j )i,j∈{1,...,d} ∈ Rd×d that kAkF =
23
qP
d 2
i,j=1 (ai,j ) , let h·, ·i : (∪d∈N Rd × Rd ) → R satisfy for all d ∈ N, x = (x1 , ..., xd ), y = (y1 , ..., yd ) ∈ Rd
Pd d d d
that hx, yi = i=1 xi yi , for all d ∈ N let gd ∈ C(R , R), µd = (µd,i )i∈{1,...,d} ∈ C(R , R ), σd =
d d×d
(σd,i,j )i,j∈{1,...,d} ∈ C(R , R ), let f ∈ C(R, R), assume for all v, w ∈ R that |f (v) − f (w)| ≤ c|v − w|,
for every ε ∈ (0, 1], d ∈ N, v ∈ Rd let σ̂d,ε ∈ C(Rd , Rd×d ), gd,ε , µ̃d,ε , σ̃d,ε,v ∈ N, assume for all d ∈
N, x, y, v ∈ Rd , ε ∈ (0, 1] that R(gd,ε ) ∈ C(Rd , R), R(µ̃d,ε ) ∈ C(Rd , Rd ), R(σ̃d,ε,v ) ∈ C(Rd , Rd ), R(σ̃d,ε,v ) =
σ̂d,ε (·)v, D(σ̃d,ε,v ) = D(σ̃d,ε,0 ),
1
|(R(gd,ε ))(x)| ≤ b(d2c + kxk2 ) 2 , max{k(R(µ̃d,ε ))(0)k , kσ̂d,ε (0)kF } ≤ cdc (128)
2
max |gd (x) − (R(gd,ε ))(x)|, kµd (x) − (R(µ̃d,ε )) (x)k , kσd,ε (x) − σ̂d,ε (x)kF ≤ εBdp (d2c +kxk )q , (129)
q
− 12 2 2
|(R(gd,ε ))(x) − (R(gd,ε ))(y)| ≤ bT 2d2c + kxk + kyk kx − yk , (130)

max k(R(µ̃d,ε ))(x) − (R(µ̃d,ε ))(y)k , kσ̂d,ε (x) − σ̂d,ε (y)kF ≤ c kx − yk , (131)
max {|||D(gd,ε )|||, |||D(µ̃d,ε )|||, |||D(σ̃d,ε,0 )|||} ≤ Bdp ε−α , (132)
p −β
max {dim(D(gd,ε )), dim(D(µ̃d,ε )), dim(D(σ̃d,ε,0 ))} ≤ Bd ε , (133)
R 2q+1 1
and for all d ∈ N let νd be a probability measure on (Rd , B(Rd )) satisfying ( Rd kyk νd (dy)) 2q+1 ≤ Bdp .
Then
(i) for every d ∈ N there exists a unique viscosity solution ud ∈ {u ∈ C([0, T ] × Rd , R) :
sup sup |u(s,y)|
1+kyk < ∞} of
s∈[0,T ]y∈Rd
∂ 1
( ud )(t, x) + hµd (x), (∇x ud )(t, x)i + Tr(σd (x)[σd (x)]∗ (Hessx ud )(t, x)) + f (ud (t, x)) = 0 (134)
∂t 2
with ud (T, x) = gd (x) for t ∈ (0, T ), x ∈ Rd ,
(ii) there exist C = (Cγ )γ∈(0,1] : (0, 1] → (0, ∞), η ∈ (0, ∞), (Ψd,ε )d∈N,ε∈(0,1] ⊆ N such that for all
d ∈ N, ε ∈ (0, 1] it holds that R(Ψd,ε ) ∈ C(Rd , R), P(Ψd,ε ) ≤ Cγ dη ε−(6+2α+β+γ) , and
Z 21
2
|ud (0, x) − (R(Ψd,ε ))(x)| νd (dx) ≤ ε. (135)
Rd
Proof. Throughout the proof assume without loss of generality that

n 1
o
B ≥ max 16, |f (0)| + 1, 16|c(4c + 2|f (0)|)| q−1 . (136)
Note that the triangle inequality, (129) and (131) imply for all d ∈ N, x ∈ Rd , ε ∈ (0, 1] that
kµd (x) − µd (y)k ≤ kµd (x) − (R(µ̃d,ε ))(x)k + k(R(µ̃d,ε ))(x) − (R(µ̃d,ε ))(y)k
+ k(R(µ̃d,ε ))(y) − µd (y)k (137)

2 2
≤ εBdp (d2c + kxk )q + (d2c + kyk )q + c kx − yk .
This proves for all d ∈ N, x ∈ Rd that
kµd (x) − µd (y)k ≤ c kx − yk . (138)
24
In the same manner it follows for all d ∈ N, x ∈ Rd that
kσd (x) − σd (y)kF ≤ c kx − yk (139)
and q
1 2 2
|gd (x) − gd (y)| ≤ bT − 2 2d2c + kxk + kyk kx − yk . (140)
Note that due to the triangle inequality, (128), and (129) it holds for all d ∈ N, ε ∈ (0, 1] that
max{kµd (0)k , kσd (0)kF } ≤ max{kµd (0) − (R(µ̃d,ε ))(0)k , kσd (0) − σ̂d,ε (0)kF } + cdc
(141)
≤ εBdp+2qc + cdc ,
which shows for all d ∈ N that it holds that
max{kµd (0)k , kσd (0)kF } ≤ cdc . (142)
Next, the triangle inequality, (128), and (129) ensure for all d ∈ N, x ∈ Rd , ε ∈ (0, 1] that
2 2 1
|gd (x)| ≤ |gd (x) − (R(gd,ε ))(x)| + |(R(gd,ε ))(x)| ≤ εBdp (d2c + kxk )q + b(d2c + kxk ) 2 , (143)
which demonstrates for all d ∈ N, x ∈ Rd that

2 1
|gd (x)| ≤ b(d2c + kxk ) 2 . (144)
Next, [23, Corollary 3.13] (applied for all ε ∈ (0, 1] with L x c, q x q, f x f, ε x ε in the notation of
[23, Corollary 3.13]) ensures existence of fε ∈ N, ε ∈ (0, 1], which satisfy for all v, w ∈ R, ε ∈ (0, 1] that
R(fε ) ∈ C(R, R), |(R(fε ))(v) − (R(fε ))(w)| ≤ c|v − w|, |f (v) − (R(fε ))(v)| ≤ ε(1 + |v|q ), dim(D(fε )) = 3
and n o q 1 − q−1
|||D(fε )||| ≤ 16 max 1, |c(4c + 2|f (0)|)| (q−1) . (145)
Due to the fact that B ≥ 1 + |f (0)|, it holds for all ε ∈ (0, 1] that
|(R(fε ))(0)| ≤ |(R(fε ))(0) − f (0)| + |f (0)| ≤ ε + |f (0)| ≤ B, (146)
which ensures for all ε ∈ (0, 1] that
b b + BT
max{|f (0)|, |(R(fε ))(0)|} ≤ +B = . (147)
T T
2
Next, for all d ∈ N let ϕd ∈ C 2 (Rd , [1, ∞)) satisfy for all x ∈ Rd that ϕd (x) = d2c + kxk . Hence for all
d ∈ N, x ∈ Rd we have
(∇x ϕd )(x) = 2x and (Hessx ϕd )(x) = 2IdRd . (148)
By the Cauchy-Schwartz inequality it thus holds for all d ∈ N, x ∈ Rd , z ∈ Rd \ {0} that
|h(∇x ϕd )(x), zi| |2hx, zi|

1/2
≤ ≤ 2 ≤ 2c, (149)
(ϕd (x)) kzk kxk kzk
2
hz, (Hessx ϕd )(x)zi 2 kzk
2 = 2 = 2 ≤ 2c. (150)
(ϕd (x))0 kzk kzk
25
2
Next, Jensen’s inequality ensures that for
√ all d ∈ N, x ∈ Rd we have (dc + kxk)2 ≤ 2(d2c + kxk ) = 2ϕd (x),
1
which in turn implies that dc + kxk ≤ 2(ϕd (x)) 2 . This and (141) ensure for all d ∈ N, x ∈ Rd , ε ∈ (0, 1]
that
c kxk + max{kµd (0)k , kσd (0)kF , k(R(µ̃d,ε ))(0)k , kσ̂d,ε (0)kF } c(dc + kxk) √
1 ≤ 1 ≤ 2c ≤ 2c. (151)
(ϕd (x)) 2 (ϕd (x)) 2
Next, note that for all d ∈ N it holds that
|f (0)| + |gd (x)|
sup [ inf ϕd (x)] = ∞ and inf sup = 0. (152)
d
r∈(0,∞) x∈R ,kxk≥r r∈(0,∞) x∈Rd ,kxk>r ϕd (x)
Lemma 2.1 (i) (applied with for all d ∈ N with d x d, m x d, c x 2c, κ x 1, p x 2, ϕ x ϕd , µ x µd ,
σ x σd in the notation of Lemma 2.1) ensures for all d ∈ N, x ∈ Rd that
1
hµ(x), (∇ϕd )(x)i + Tr(σ(x)[σ(x)]∗ (Hessϕd )(x)) ≤ 12c3 . (153)
2
Analogously we get for all d ∈ N, x ∈ Rd , ε ∈ (0, 1] that
1
h(R(µ̃d,ε ))(x), (∇ϕd )(x)i + Tr(σ̂d,ε (x)[σ̂d,ε (x)]∗ (Hessϕd )(x)) ≤ 12c3 . (154)
2
Let (Ω, F , P, (Ft )t∈[0,T ] ) be a filtered probability space satisfying the usual conditions, for all d ∈ N
let W d : [0, T ] × Ω → Rd be a standard (Ft )t∈[0,T ] -Brownian motion. Then [4, Corollary 3.2](applied
twice: the first application for all d ∈ N with d x d, m x d, ,T x T , L x c, ρ x 12c3 , O x Rd ,
µ x ([0, T ] × Rd ∋ (t, x) → µd (x) ∈ Rd ), σ x ([0, T ] × Rd ∋ (t, x) → σd (x) ∈ Rd×d ), g x gd ,
f x ([0, T ] × Rd × R ∋ (t, x, v) → f (v) ∈ R), V x ϕd in the notation of [4, Corollary 3.2]; the
second application for all d ∈ N, ε ∈ (0, 1] with d x d, m x d, ,T x T , L x c, ρ x 12c3 , O x Rd ,
µ x ([0, T ] × Rd ∋ (t, x) → (R(µ̃d,ε ))(x) ∈ Rd ), σ x ([0, T ] × Rd ∋ (t, x) → σ̂d,ε (x) ∈ Rd×d ), g x R(gd,ε ),
f x R(fε ), V x ϕd in the notation of [4, Corollary 3.2]) guarantees that
(I) for every d ∈ N there exists a unique viscosity solution
|u(s,y)|
ud ∈ {u ∈ C([0, T ] × Rd , R) : lim supr→∞ [sups∈[0,T ] supx∈Rd ,kxk>r ϕd (x) ] = 0} of
∂ 1
( ud )(t, x) + hµd (x), (∇x ud )(t, x)i + Tr(σd (x)[σd (x)]∗ (Hessx ud )(t, x)) = −f (ud (t, x)) (155)
∂t 2
with ud (T, x) = gd (x) for t ∈ (0, T ), x ∈ Rd ,
(II) for every d ∈ N, x ∈ Rd , t ∈ [0, T ], ε ∈ (0, 1] there exist up to indistinguishability unique (Fs )s∈[t,T ] -
adapted stochastic processes Xtx,d = (Xt,s x,d
)s∈[t,T ] : [t, T ]×Ω → Rd and Xtx,d,ε = (Xt,s
x,d,ε
)s∈[t,T ] : [t, T ]×
Ω → Rd with continuous sample paths satisfying that for all s ∈ [t, T ] we have P-a.s. that
Z s Z s
x,d x,d x,d
Xt,s = x + µd (Xt,r )dr + σd (Xt,r )dWrd ,
t t
Z s Z s (156)
x,d,ε x,d,ε x,d,ε d
Xt,s = x + (R(µ̃d,ε ))(Xt,r )dr + σ̂d,ε (Xt,r )dWr ,
t t
(III) for every d ∈ N, ε ∈ (0, 1] there exist unique vd ∈ C([0, T ] × Rd , R) and vd,ε ∈ C([0, T ] × Rd , R)
which satisfy for all t ∈ [0, T ], x ∈ Rd that
" # " #
vd (s, y) vd,ε (s, y) h i
x,d,ε
sup sup 1 + sup sup 1 + E (R(g d,ε ))(X t,T )
s∈[0,T ] y∈Rd (ϕd (y)) 2 s∈[0,T ] y∈Rd (ϕd (y)) 2
(157)
h i Z T h i Z T h i
x,d x,d x,d,ε
+ E gd (Xt,T ) + E f (vd (s, Xt,s )) ds + E (R(fε ))(vd,ε (s, Xt,s )) ds < ∞
t t
26
h i Z T h i
x,d x,d
vd (t, x) = E gd (Xt,T ) + E f (vd (s, Xt,s )) ds, (158)
t
h i Z T h i
x,d,ε x,d,ε
vd,ε (t, x) = E (R(gd,ε ))(Xt,T ) + E (R(fε ))(vd,ε (s, Xt,s )) ds, (159)
t
(IV) we have for all d ∈ N, x ∈ Rd , t ∈ [0, T ] that ud (t, x) = vd (t, x).
Item (II) and Lipschitz continuity of µd , σd , R(µ̃d,ε ), σ̂d,ε , d ∈ N, ε ∈ (0, 1], and , e.g. [26, Lemma 13.6],
ensure for all d ∈ N, ε ∈ (0, 1], t ∈ [0, T ], s ∈ [t, T ], r ∈ [s, T ] that
x,d X x,d ,d
x,d,ε X x,d,ε ,d,ε
P(Xt,r = Xs,rt,s ) = P(Xt,r = Xs,rt,s ) = 1. (160)
x,d
Next, item (III) ensures for all d ∈ N, x ∈ Rd , t ∈ [0, T ], s ∈ [t, T ] that ud (s, Xt,s ) is integrable. This,
(160), (144), (147), and Lemma 2.1 (ii) (applied for all d ∈ N with d x d, m x d, c x 2c, κ x 1, p x 2,
ϕ x ϕd , µ x µd , σ x σd in the notation of Lemma 2.1) prove for all d ∈ N, x ∈ Rd , t ∈ [0, T ], s ∈ [t, T ]
that
h i Z T
x,d X x,d ,d X x,d ,d
E ud (s, Xt,s ) ≤ E gd Xs,Tt,s + E f ud r, Xs,rt,s dr
s
h i Z i T h
x,d x,d
=E gd (Xt,T )
E f (ud (r, Xt,r
+ )) dr
s
" 1# Z T h i (161)
2 2
2c x,d x,d
≤ bE d + Xt,T + (T − s)|f (0)| + c E |ud (r, Xt,r )| dr
s
Z T h i
3 2 1 x,d
≤ be6c (T −t)
(d2c + kxk ) 2 + b + BT + c E |ud (r, Xt,r )| dr.
s
This and Gronwalls inequality ensure for all d ∈ N, x ∈ Rd , t ∈ [0, T ], s ∈ [t, T ] that
h i 3
x,d 2 1
E ud (s, Xt,s ) ≤ 2(b + BT )e7c (T −t) (d2c + kxk ) 2 . (162)
This ensures for all d ∈ N that

|ud (s, y)|
sup sup < ∞, (163)
s∈[0,T ] y∈Rd 1 + kyk
which combined with item (I) establish item (i). It follows analogously for all d ∈ N, ε ∈ (0, 1] that
|vd,ε (s, y)|

sup sup < ∞. (164)
s∈[0,T ] y∈Rd 1 + kyk
In order to prove item (ii), we will combine Lemma 3.10 with Corollary 2.5, for which we yet need a
suitable MLP setting.
Let Θ = ∪n∈N Zn , let tθ : Ω → [0, 1], θ ∈ Θ, be i.i.d. random variables, assume for all t ∈ (0, 1) that
P(t0 ≤ t) = t, let Tθ : [0, T ] × Ω → [0, T ] satisfy for all θ ∈ Θ, t ∈ [0, T ] that Tθt = t + (T − t)tθ ,
let W θ,d : [0, T ] × Ω → Rd , θ ∈ Θ, d ∈ N, be i.i.d. standard (Ft )t∈[0,T ] -Brownian motions, assume that
(tθ )θ∈Θ and (W θ,d )θ∈Θ,d∈N are independent, for every δ ∈ (0, 1], d ∈ N, N ∈ N, θ ∈ Θ, x ∈ Rd , t ∈ [0, T ] let
27
(n+1)T
YtN,d,θ,x,δ = (Yt,s
N,d,θ,x,δ
)s∈[t,T ] : [t, T ] × Ω → Rd satisfy for all d ∈ N, n ∈ {0, 1, ..., N }, s ∈ [ nT
N , N ]∩
N,d,θ,x,δ
[t, T ] that Yt,t = x and
N,d,θ,x,δ N,d,θ,x,δ
Yt,s − Yt,max{t, nT
}
N
nT (165)
N,d,θ,x,δ N,d,θ,x,δ θ,d θ,d
= (R(µ̃d,δ ))(Yt,max{t, nT )(s − max{t, }) + σ̂d,δ (Yt,max{t, nT )(Ws − Wmax{t, nT ),
N } N N } N }
θ
let Un,M,d,δ : [0, T ] × Rd × Ω → R, n, M ∈ Z, d ∈ N, θ ∈ Θ, δ ∈ (0, 1] satisfy for all d ∈ N, θ ∈ Θ, n ∈ N0 , t ∈
[0, T ], x ∈ Rd , δ ∈ (0, 1] that
n
θ 1N (n) X
M
M M ,d,(θ,0,−i),x,δ
Un,M,d,δ (t, x) = (R(gd,δ ))(Yt,T )
Mn i=1
 n−l  (166)
n−1 M
X (T − t)  X M M ,d,(θ,l,i),x,δ 
(R(fδ ) ◦ Ul,M,d − 1N (l)R(fδ ) ◦ Ul−1,M,d )(Tt
(θ,l,i) (θ,−l,i) (θ,l,i)
+ , Y (θ,l,i) ) .
M n−l i=1
t,Tt
l=0
Let kd,ε ∈ N, d ∈ N, ε ∈ (0, 1] satisfy for all d ∈ N, ε ∈ (0, 1], that

kd,ε = max{|||D(fε )|||, |||D(gd,ε )|||, |||D(µ̃d,ε )|||, |||D(σ̃d,ε,0 )|||, 2}, (167)
let cd ∈ [1, ∞), d ∈ N, satisfy for all d ∈ N
4
+24c3 )T +(4ct)q +(2c+1)2
cd = 8q bq c2 d(c+p)(q+1) B q (T + 1)eq(q32c , (168)
let Nd,ε ∈ N, d ∈ N, ε ∈ (0, 1] satisfy for all d ∈ N, ε ∈ (0, 1]

exp(4ncT + n2 ) 1 ε
Nd,ε = min n ∈ N ∩ [2, ∞) : cd + n/2 ≤ , (169)
nn/2 n 2
let C = (Cγ )γ∈(0,1] : (0, 1] → (0, ∞] satisfy for all γ ∈ (0, 1] that
  6+γ 
1
[4cT + 2 ](n−1)
e + 1
sup (3n)3n+1 
 
Cγ = (n−1)/2 , (170)
n∈N∩[2,∞) (n − 1)
ε
and for all d ∈ N, ε ∈ (0, 1] let δd,ε = 4Bdp cd . Observe that for all γ ∈ (0, 1] it holds that
   
1 6+γ
e[4cT + 2 ](n−1)+1 
sup (3n)3n+1 
 
Cγ ≤ 
n∈N∩[2,∞) (n − 1)(n−1)/2
 (n−1)(6+γ) 
 3
4cT + 2
e
sup (3n)3n+1 
 
≤ 1/2
 
n∈N∩[2,∞) (n − 1)
 
3
[4cT + 2 ](n−1)(6+γ) 3(n−1) (171)
4 3n+1 e n
= sup n 3 γ · 
n∈N∩[2,∞) (n − 1) 2 (n−1) n−1
  (n−1) 
3 " 3(n−1) #
[4cT + 2 ](6+γ)
 4 3n+1  e  n
≤ sup n 3 γ
  sup
n∈N∩[2,∞) (n − 1) 2 n∈N∩[2,∞) n−1
< ∞.
28
Note that the definition of (cd )d∈N implies that there exists a c̃ ∈ [1, ∞) such that for all d ∈ N it holds
that
cd = c̃d(c+p)(q+1) . (172)
Corollary 2.5 (applied for every d, M ∈ N, n ∈ N0 , x ∈ Rd , δ ∈ (0, 1] with δ x δBdp , c x 2c, b x b +
BT, p x 2, q x q, ϕ x ϕd , f1 x ([0, T ]×Rd ×R ∋ (t, x, v) → f (v) ∈ R), f2 x ([0, T ]×Rd ×R ∋ (t, x, v) →
(R(fδ ))(v) ∈ R)), g1 x gd , g2 x R(gd,δ ), µ1 x µd , µ2 x R(µ̃d,δ ), σ1 x σd , σ2 x σ̂d,δ , u1 x ud in the
notation of Corollary 2.5) shows that for all d, M ∈ N, n ∈ N0 , x ∈ Rd , δ ∈ (0, 1] it holds that
0 1 4 3 q 2
E |Un,M,d,δ (0, x) − ud (0, x)|2 2 ≤ 4q+1 bq c2 (T + 1)eq(q32c +24c )T +(4cT ) +(2c+1)
" #
q+1
2 2 4
M
e4ncT + 2 1 (173)
2c p
· d + kxk δBd + n + M .
M 2 M 2
This, the triangle inequality, Jensen’s inequality,

R and the fact that for all d ∈ N it holds that νd is a
probability measure on (Rd , B(Rd )) satisfying ( Rd ||y||2q νd (dy))1/(2q) ≤ Bdp imply
Z 21
0
E |Un,M,d,δ (0, x) − ud (0, x)|2 νd (dx)
Rd
4
+24c3 )T +(4cT )q +(2c+1)2
≤ 4q+1 bq c2 (T + 1)eq(q32c
M
! Z 12
e4ncT + 2
p 1
2c 2
q+ 12
· δBd + n + M d + kxk νd (dx)
M 2 M 2 Rd
q+1 q 2 q(q32c4 +24c3 )T +(4cT )q +(2c+1)2
(174)
≤4 b c (T + 1)e
M
! Z 12
e4ncT + 2
p 1 q− 12 c(2q+1) q− 21 2q+1
· δBd + n + M 2 d +2 kxk νd (dx)
M 2 M 2 Rd
M
!
e4ncT + 2
p 1
≤ cd δBd + n + M .
M2 M 2
This, the definitions of (Nd,ε )d∈N,ε∈(0,1] , (δd,ε )d∈N,ε∈(0,1] , and Fubini’s theorem show for all d ∈ N, ε ∈ (0, 1]
that Z
0
E |UN d,ε ,Nd,ε ,d,δd,ε
(0, x) − ud (0, x)|2 νd (dx)
Rd
Z h i ε ε 2 (175)
0 2
= E |UN ,N ,d,δ (0, x) − u d (0, x)| ν d (dx) ≤ + ≤ ε2 .
Rd
d,ε d,ε d,ε
4 2
This shows that for all d ∈ N, ε ∈ (0, 1] there exists a ωd,ε ∈ Ω such that
Z
0
|UN d,ε ,Nd,ε ,d,δd,ε
(0, x, ωd,ε ) − ud (0, x)|2 νd (dx) ≤ ε2 . (176)
Rd
N
This and Lemma 3.10 (applied for all d ∈ N, v ∈ Rd , ε ∈ (0, 1], with c x kd,δd,ε , d x d, K x Nd,εd,ε ,
M x Nd,ε , T x T , Φµ x µ̃d,δd,ε , Φσ,v x σ̃d,δd,ε ,v , Φf x fδd,ε , Φg x gd,δd,ε in the notation of Lemma
3.10) imply that for all d ∈ N, ε ∈ (0, 1] there exists a deep neural network Ψd,ε ∈ N such that for
all x ∈ Rd it holds that R(Ψd,ε ) ∈ C(Rd , R), (R(Ψd,ε ))(x) = UN 0
d,ε ,Nd,ε ,d,δd,ε
(0, x, ωd,ε ), |||D(Ψd,ε )||| ≤
29
kd,δd,ε (3Nd,ε )Nd,ε , and
N
dim(D(Ψd,ε )) ≤ (Nd,ε + 1)Nd,εd,ε max{dim(D(µ̃d,δd,ε )), dim(D(σ̃d,ε,0 ))} − 1
+ Nd,ε + dim(D(gd,δd,ε )) − 1
−β N−β
≤ Nd,ε + Bdp δd,ε + (Nd,ε + 1)Nd,εd,ε (Bdp δd,ε − 1) − 1 (177)
−β −β
≤ Nd,ε Bdp δd,ε + (Nd,ε + 1)Nd,ε +1 (Bdp δd,ε )
Nd,ε +1 p −β
≤ 2(Nd,ε + 1) Bd δd,ε .
Hence we get for all d ∈ N, ε ∈ (0, 1] that
dim(D(Ψd,ε ))
X
P(Ψd,ε ) ≤ kd,δd,ε (3Nd,ε )Nd,ε (kd,δd,ε (3Nd,ε )Nd,ε + 1)
j=1
h i
2 2Nd,ε Nd,ε
= dim(D(Ψd,ε )) kd,δ d,ε
(3N d,ε ) + kd,δd,ε
(3N d,ε ) (178)
h i
Nd,ε +1 −β
≤ 2(Nd,ε + 1) Bdp δd,ε 2
2kd,δd,ε
(3N d,ε )2Nd,ε
−β 2
≤ 4Bdp δd,ε kd,δd,ε (3Nd,ε )3Nd,ε +1 .
Note that the definition of (kd,ε )d∈N,ε∈(0,1] implies for all d ∈ N, ε ∈ (0, 1] that
kd,ε = max{|||D(fε )|||, |||D(gd,ε )|||, |||D(µ̃d,ε )|||, |||D(σ̃d,ε,0 )|||, 2}
≤ max{Bε−2 , Bdp ε−α , 2} (179)
−α p −α p −α
≤ max{Bε , Bd ε , 2} ≤ Bd ε .
This shows that for all d ∈ N, ε ∈ (0, 1] it holds that
−β 2
P(Ψd,ε ) ≤ 4Bdp δd,ε kd,δd,ε (3Nd,ε )3Nd,ε +1
−2α−β
(180)
≤ 4B 3 d3p δd,ε (3Nd,ε )3Nd,ε +1 .
Next, note that the definition of (Nd,ε )d∈N,ε∈(0,1] implies that for all d ∈ N, ε ∈ (0, 1] it holds that
exp((4cT + 21 )(Nd,ε − 1)) + 1
ε ≤ 2cd . (181)
(Nd,ε − 1)(Nd,ε −1)/2
This and the definition of (Cγ )γ∈(0,1] imply for all d ∈ N, γ, ε ∈ (0, 1] that
−2α−β
P(Ψd,ε ) ≤ 4B 3 d3p δd,ε (3Nd,ε )3Nd,ε +1 ε6+γ ε−6−γ
6+γ
3 3p −2α−β −6−γ 3Nd,ε +1 exp((4cT + 12 )(Nd,ε − 1)) + 1
≤ 4B d δd,ε ε (3Nd,ε ) 2cd
(Nd,ε − 1)(Nd,ε −1)/2
 !6+γ 
(4cT + 12 )(n−1) (182)
−2α−β −6−γ 6+γ  e +1
≤ 28+γ B 3 d3p δd,ε ε cd sup (3n)3n+1 n−1

n∈N∩[2,∞) (n − 1) 2
−2α−β −6−γ 6+γ

= 28+γ B 3 d3p δd,ε ε cd Cγ
= 28+4α+2β+γ B 3+2α+β Cγ cd6+2α+β+γ dp(3+2α+β) ε−(6+2α+β+γ) .
Combining this with (171) and (172) ensures that there exist C = (Cγ )γ∈(0,1] : (0, 1] → [0, ∞) and
η ∈ (0, ∞) such that for all d ∈ N, γ, ε ∈ (0, 1] it holds that P(Ψd,ε ) ≤ Cγ dη ε−(2α+β+γ+6) . This
establishes item (ii) and thus completes the proof.
30
4.2 Deep neural network approximations with positive polynomial conver-
gence rates
This corollary is an application of Theorem 4.1, where for all d ∈ N we set the probability measure νd to
be the d-dimensional Lebesgue measure restricted on the d-dimensional unit cube [0, 1]d .
Corollary 4.2. Assume Setting 3.1, let c, T ∈q(0, ∞), let k·kF : (∪d∈N Rd×d ) → [0, ∞) satisfy for all
Pd
d ∈ N, A = (ai,j )i,j∈{1,...,d} ∈ Rd×d that kAkF = 2 d d
i,j=1 (ai,j ) , let h·, ·i : (∪d∈N R × R ) → R satisfy for
Pd
all d ∈ N, x = (x1 , ..., xd ), y = (y1 , ..., yd ) ∈ Rd that hx, yi = i=1 xi yi , let f ∈ C(R, R), for all d ∈ N let
ud ∈ C ([0, T ] × R , R), µd = (µd,i )i∈{1,...,d} ∈ C(R , R ), σd = (σd,i,j )i,j∈{1,...,d} ∈ C(Rd , Rd×d ) satisfy
1,2 d d d
for all t ∈ [0, T ], x = (x1 , ..., xd ), y = (y1 , ..., yd ) ∈ Rd that ud (T, x) = gd (x),
h i
|f (x1 ) − f (y1 )| ≤ c|x1 − y1 |, |ud (t, x)|2 ≤ c dc + kxk2 , (183)
∂
ud (t, x) + h(∇x ud )(t, x), µd (x)i + 12 Tr (σd (x)[σd (x)]∗ (Hessx ud )(t, x)) + f (ud (t, x)) = 0, (184)
∂t
for all d ∈ N, v ∈ Rd , ε ∈ (0, 1] let σ̂d,ε ∈ C(Rd , Rd×d ), gd,ε , µ̃d,ε , σ̃d,ε,v ∈ N satisfying for all d ∈
2
N, x, y, v ∈ Rd , ε ∈ (0, 1], that R(gd,ε ) ∈ C(Rd , R), R(µ̃d,ε ) ∈ C(Rd , Rd ), R(σ̃d,ε ) ∈ C(Rd , Rd ),
R(σ̃d,ε,v ) = σ̂d,ε (·)v, D(σ̃d,ε,v ) = D(σ̃d,ε,0 ),
h i
|(R(gd,ε ))(x)|2 + max (|[(R(µ̃d,ε ))(0)]i | + |(σ̂d,ε (0))i,j |) ≤ c dc + kxk2 , (185)
i,j∈{1,...,d}
c
max{|gd (x) − (R(gd,ε ))(x)|, kµd (x) − (R(µ̃d,ε ))(x)k , kσd (x) − σ̂d,ε (x)kF } ≤ εcdc (1 + kxk ), (186)
max{|(R(gd,ε ))(x) − (R(gd,ε ))(y)|, k(R(µ̃d,ε ))(x) − (R(µ̃d,ε ))(y)k , kσ̂d,ε (x) − σ̂d,ε (y)kF } ≤ c kx − yk ,
(187)
max{P(gd,ε ), P(µ̃d,ε ), P(σ̃d,ε,v )} ≤ cdc ε−c . (188)
Then there exist (Ψd,ε )d∈N,ε∈(0,1] ⊆ N, η ∈ (0, ∞) such that for all d ∈ N, ε ∈ (0, 1] it holds that R(Ψd,ε ) ∈
C(Rd , R), P(Ψd,ε ) ≤ ηdη ε−η and
"Z # 12
|ud (0, x) − (R(Ψd,ε ))(x)|2 dx ≤ ε. (189)
[0,1]d
Proof. Throughout the proof assume without loss of generality that c ∈ [2, ∞) (otherwise adapt the
application of Theorem 4.1 below accordingly), and let B = 2c. We thus have that
max{|||D(gd,ε )|||, dim(D(gd,ε ))} ≤ cdc ε−c ≤ Bdc ε−c . (190)

√
For all y ∈ [0, 1]d we have kyk ≤ d and thus, for all α ∈ (0, ∞) it holds that
Z ! α1
α
√
kyk dy ≤ d ≤ Bd. (191)
[0,1]d
Jensen’s inequality ensures for all d ∈ N, x ∈ Rd , ε ∈ (0, 1] that

c
|ud (T, x) − (R(gd,ε ))(x)| ≤ εcdc (1 + kxk )
2
≤ εcdc 2(d2c + kxk )c (192)
c 2c 2 c
= εBd (d + kxk ) .
31
Next, (185) ensures for all d ∈ N, x ∈ Rd , ε ∈ (0, 1] that
√ c 1
2 2 √ √ 1
2 2
|(R(gd,ε ))(x)| ≤ c d + kxk ≤ ( c + c T ) d2c + kxk , (193)
21
k(R(µ̃d,ε ))(0)k ≤ d max |[(R(µ̃d,ε ))(0)]i |2 ≤ cdc+1 , (194)
i∈{1,...,d}
and that 12
2 2
kσ̂d,ε (0)kF ≤ d max |(σ̂d,ε (0))i,j | ≤ cdc+1 . (195)
i,j∈{1,...,d}
Display (187) ensures for all d ∈ N, x ∈ Rd , ε ∈ (0, 1] that

r
c
|(R(gd,ε ))(x) − (R(gd,ε ))(y)| ≤ c kx − yk ≤ + c kx − yk . (196)
T
2
Furthermore, the fact that for all d ∈ N, t ∈ [0, T ], x ∈ Rd it holds that |ud (t, x)|2 ≤ c(dc + kxk ),
Lipschitz continuity of f , and the fact that every classical solution is also a viscosity solution ensure for
all d ∈ N that√ud satisfies
√ (134). This combined with (190)-(196), and Theorem 4.1 (applied with α x c,
β x c, b x c + c T , B x B, c x c + 1, p x c, q x c, p x 1, γ x 21 in the notation of Theorem
4.1) proves existence of (Ψd,ε )d∈N,ε∈(0,1] ⊆ N, η ∈ (0, ∞) such that for all d ∈ N, ε ∈ (0, 1] it holds that
R(Ψd,ε ) ∈ C(Rd , R), P(Ψd,ε ) ≤ ηdη ε−η and
"Z # 21
2
|ud (0, x) − (R(Ψd,ε ))(x)| dx ≤ ε. (197)
[0,1]d
The proof is thus completed.

Acknowledgement. The second author acknowledges funding through the research grant HU1889/7-1.
References
[1] Ankudinova, J., and Ehrhardt, M. On the numerical solution of nonlinear Black–Scholes
equations. Computers & Mathematics with Applications 56, 3 (2008), 799–812. Mathematical Models
in Life Sciences & Engineering.
[2] Beck, C., Becker, S., Cheridito, P., Jentzen, A., and Neufeld, A. Deep splitting method
for parabolic PDEs, 2021, arXiv: 1907.03452.
[3] Beck, C., E, W., and Jentzen, A. Machine learning approximation algorithms for high-
dimensional fully nonlinear partial differential equations and second-order backward stochastic dif-
ferential equations. Journal of Nonlinear Science 29, 4 (Jan 2019), 1563–1619.
[4] Beck, C., Hutzenthaler, M., and Jentzen, A. On nonlinear Feynman–Kac formulas for
viscosity solutions of semilinear parabolic partial differential equations. Stochastics and Dynamics
(Jul 2021), 2150048.
[5] Beck, C., Hutzenthaler, M., Jentzen, A., and Kuckuck, B. An overview on deep learning-
based approximation methods for partial differential equations, 2021, 2012.12348.
32
[6] Beck, C., Hutzenthaler, M., Jentzen, A., and Magnani, E. Full history recursive multilevel
Picard approximations for ordinary differential equations with expectations, 2021, arXiv: 2103.02350.
[7] Bergman, Y. Option pricing with differential interest rates. Review of Financial Studies 8 (04
1995), 475–500.
[8] Berner, J., Grohs, P., and Jentzen, A. Analysis of the generalization error: Empirical risk
minimization over deep artificial neural networks overcomes the curse of dimensionality in the numer-
ical approximation of Black–Scholes partial differential equations. SIAM Journal on Mathematics of
Data Science 2, 3 (Jan 2020), 631–657.
[9] Cox, S., Hutzenthaler, M., and Jentzen, A. Local Lipschitz continuity in the initial value
and strong completeness for nonlinear stochastic differential equations, 2021, 1309.5595.
[10] Crandall, M., Ishii, H., and Lions, P.-L. User’s guide to viscosity solutions of second order
partial differential equations. Bulletin of the American Mathematical Society 27 (07 1992).
[11] E, W., Han, J., and Jentzen, A. Deep learning-based numerical methods for high-dimensional
parabolic partial differential equations and backward stochastic differential equations. Communica-
tions in Mathematics and Statistics 5, 4 (Nov 2017), 349–380.
[12] E, W., Hutzenthaler, M., Jentzen, A., and Kruse, T. Multilevel Picard iterations for solving
smooth semilinear parabolic heat equations, 2019, arXiv: 1607.03295.
[13] E, W., Hutzenthaler, M., Jentzen, A., and Kruse, T. On multilevel Picard numerical approx-
imations for high-dimensional nonlinear parabolic partial differential equations and high-dimensional
nonlinear backward stochastic differential equations. Journal of Scientific Computing 79, 3 (Mar
2019), 1534–1571.
[14] Fehrman, B., Gess, B., and Jentzen, A. Convergence rates for the stochastic gradient descent
method for non-convex objective functions, 2019, arXiv: 1904.01517.
[15] Grohs, P., and Herrmann, L. Deep neural network approximation for high-dimensional elliptic
PDEs with boundary conditions, 2020, arXiv: 2007.05384.
[16] Grohs, P., and Herrmann, L. Deep neural network approximation for high-dimensional parabolic
Hamilton-Jacobi-Bellman equations, 2021, arXiv: 2103.05744.
[17] Grohs, P., Hornung, F., Jentzen, A., and von Wurstemberger, P. A proof that artificial
neural networks overcome the curse of dimensionality in the numerical approximation of Black-
Scholes partial differential equations, 2018, arXiv: 1809.02362.
[18] Grohs, P., Hornung, F., Jentzen, A., and Zimmermann, P. Space-time error estimates for
deep neural network approximations for differential equations, 2019, arXiv: 1908.03833.
[19] Han, J., Jentzen, A., and E, W. Solving high-dimensional partial differential equations us-
ing deep learning. Proceedings of the National Academy of Sciences 115, 34 (2018), 8505–8510,
https://www.pnas.org/content/115/34/8505.full.pdf.
[20] Hutzenthaler, M., and Jentzen, A. On a perturbation theory and on strong convergence rates
for stochastic ordinary and partial differential equations with nonglobally monotone coefficients. The
Annals of Probability 48, 1 (Jan 2020).
33
[21] Hutzenthaler, M., Jentzen, A., Kruse, T., Anh Nguyen, T., and von Wurstemberger,
P. Overcoming the curse of dimensionality in the numerical approximation of semilinear parabolic
partial differential equations. Proceedings of the Royal Society A: Mathematical, Physical and Engi-
neering Sciences 476, 2244 (Dec 2020), 20190630.
[22] Hutzenthaler, M., Jentzen, A., Kruse, T., and Nguyen, T. A. Multilevel Picard approx-
imations for high-dimensional semilinear second-order PDEs with Lipschitz nonlinearities, 2020,
arXiv:2009.02484.
[23] Hutzenthaler, M., Jentzen, A., Kruse, T., and Nguyen, T. A. A proof that rectified deep
neural networks overcome the curse of dimensionality in the numerical approximation of semilinear
heat equations. SN Partial Differential Equations and Applications 1, 2 (Apr 2020).
[24] Jentzen, A., Salimova, D., and Welti, T. A proof that deep artificial neural networks over-
come the curse of dimensionality in the numerical approximation of Kolmogorov partial differential
equations with constant diffusion and nonlinear drift coefficients, 2019, arXiv: 1809.07321.
[25] Reisinger, C., and Zhang, Y. Rectified deep neural networks overcome the curse of dimensionality
for nonsmooth value functions in zero-sum games of nonlinear stiff systems. Analysis and Applications
18 (05 2020).
[26] Rogers, L. C. G., and Williams, D. Diffusions, markov processes and martingales, Volume
2: Itô Calculus, second edition. ed. Cambridge mathematical library. Cambridge University Press,
Cambridge, 2000.
[27] Takahashi, A., and Yamada, T. Asymptotic Expansion and Deep Neural Networks Overcome the
Curse of Dimensionality in the Numerical Approximation of Kolmogorov Partial Differential Equa-
tions with Nonlinear Coefficients. CIRJE F-Series CIRJE-F-1167, CIRJE, Faculty of Economics,
University of Tokyo, May 2021.
[28] Walter, W. Ordinary Differential Equations. Graduate Texts in Mathematics. Springer, 1998.
34

2205.14398 Deep Neural Networks Overcome The Curse

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2205.14398 Deep Neural Networks Overcome The Curse

Uploaded by

Copyright:

Available Formats

Deep neural networks overcome the curse of dimensionality in

the numerical approximation of semilinear partial differential

May 31, 2022

2 A perturbation result for MLP-approximations 6

3 Deep Neural Networks 13

4 Deep neural network approximations for PDEs 23

R(Φ) ∈ C(Rk0 , RkH+1 ), (R(Φ))(x0 ) = WH+1 xH + BH+1 , (1)

1.1 An Introduction to MLP-approximations

2 A perturbation result for MLP-approximations

2.1 An estimate for Lyapunov-type functions.

This establishes item (ii) and thus finishes the proof.

2.2 A perturbation estimate for solutions of PDEs

h(∇ϕ)(x), zi hz, (Hessϕ(x))zi c kxk + max{kµi (0)k , kσi (0)kF }

for all t ∈ [0, T ], s ∈ [t, T ], r ∈ [s, T ], x ∈ Rd , i ∈ {1, 2} let

This, (44), and Gronwall’s inequality yield

This finishes the proof.

The proof is thus finished.

2.3 A full error estimate

N,θ,x N,θ,x N,θ,x nT N,θ,x θ θ

Then it holds for all n ∈ N0 , M ∈ N, t ∈ [0, T ], x ∈ Rd that

This finishes the proof.

3 Deep Neural Networks

TL ◦ A ◦ TL−1 ◦ A ◦ TL−2 ◦ A ◦ · · · ◦ T2 ◦ A ◦ T1 , (62)

3.1 A mathematical framework for deep neural networks

let D : N → D, P : N → N, and R : N →h (∪k,l∈N C(Rk , Rl )) satisfy

N ∩ [1, H] : xn = Akn (Wn xn−1 + Bn ) that

R(Φ) ∈ C(Rk0 , RkH+1 ) and (R(Φ))(x0 ) = WH+1 xH + BH+1 . (67)

and let nn ∈ D, n ∈ [3, ∞) ∩ N, satisfy for all n ∈ [3, ∞) ∩ N that

IdRd ∈ R({Φ ∈ N : D(Φ) = ndH+2 }). (73)

for all l ∈ [2, H] ∩ N let Wl ∈ R2d×2d , Bl ∈ R2d satisfy

Wl = IdR2d ∈ R2d×2d , Bl = 0R2d ∈ R2d , (75)

and let WH+1 ∈ Rd×2d , BH+1 ∈ Rd satisfy

WH+1 = W1⊤ ∈ Rd×2d , BH+1 = 0Rd ∈ Rd , (76)

xn = A2d (Wn xn−1 + Bn ). (77)

This implies that D(φ) = ndH+2 . Furthermore, it holds that

It thus holds that

g ∈ R ({Φ ∈ N : dim(D(Φ)) = H + 2 + n}) . (81)

(R(Φg ))(x0 ) = WH+1 AkH (WH xH−1 + BH ) + BH+1

3.3 A DNN representation for time-recursive functions

gs (x) = gxsy (x) + σ(gxsy (x))(f (s) − f (xsy)). (83)

Then for all s ∈ [0, T ] there exists a gs ∈ N such that

This ensures for all s ∈ [0, T ], x ∈ Rd that

Φks (x) = x + σ(x) [f ((s ∨ τk−1 ) ∧ τk ) − f (τk−1 )] . (86)

For all k ∈ [1, K] ∩ N, s ∈ [0, T ] let Ψks ∈ C(Rd , Rd ) satisfy

Ψks = Φks ◦ Φsk−1 ◦ · · · ◦ Φ1s . (87)

This and the fact that for all s ∈ [0, T ], x ∈ Rd it holds ΨK

This finishes the proof.

3.4 A DNN representation of Euler-Maruyama approximations

Hence, for all v = (v1 , v ′ ) ∈ R × Rd , x ∈ Rd it holds that

3.5 A DNN representation of MLP approximations

let Unθ : [0, T ] × Rd × Ω → R, n ∈ Z, θ ∈ Θ, satisfy for all θ ∈ Θ, n ∈ N0 , t ∈ [0, T ], x ∈ Rd that

(ii) it holds for all t ∈ [0, T ], θ ∈ Θ that

dim(D(Φθn,t )) = (n + 1)K(max{dim(D(Φµ )), dim(D(Φσ,0 )))} − 1) − 1

(iii) it holds for all t ∈ [0, T ], θ ∈ Θ that

D(Φθn,t ) ≤ c(3M )n , (106)

(iv) it holds for all t ∈ [0, T ], θ ∈ Θ, x ∈ Rd that

Unθ (t, x, ω) = (R(Φθn,t ))(x). (107)

Proof. Throughout this proof let Y ∈ N satisfy

Y = K(max{dim(D(Φµ )), dim(D(Φσ,0 ))} − 1). (108)

This, (112), and Lemma 3.4

dim(D(Φ0l,0 )) = l(dim(D(Φf )) − 2) + (l + 1)Y + dim(D(Φg )) − 1 (117)

This shows for all t1 , t2 ∈ [0, T ], θ1, θ2 ∈ Θ that

D(Φ0l,0 ) ≤ c(3M )l (126)