A Rate of Convergence of Physics Informed Neural Networks For The Linear Second Order Elliptic PDEs

Commun. Comput. Phys. Vol. 31, No. 4, pp.
1272-1295
doi: 10.4208/cicp.OA-2021-0186 April 2022
A Rate of Convergence of Physics Informed Neural

Networks for the Linear Second Order Elliptic PDEs
Yuling Jiao1,2 , Yanming Lai1 , Dingwei Li1 , Xiliang Lu1,2 ,
Fengru Wang1,2 , Yang Wang3 and Jerry Zhijian Yang1,2, ∗
1 School of Mathematics and Statistics, Wuhan University, P.R. China.
2 Hubei Key Laboratory of Computational Science, Wuhan University, P.R. China.
3 Department of Mathematics, The Hong Kong University of Science and Technology,
Clear Water Bay, Kowloon, Hong Kong.
Received 16 September 2021; Accepted (in revised version) 16 March 2022
Abstract. In recent years, physical informed neural networks (PINNs) have been
shown to be a powerful tool for solving PDEs empirically. However, numerical analy-
sis of PINNs is still missing. In this paper, we prove the convergence rate to PINNs for
the second order elliptic equations with Dirichlet boundary condition, by establishing
the upper bounds on the number of training samples, depth and width of the deep
neural networks to achieve desired accuracy. The error of PINNs is decomposed into
approximation error and statistical error, where the approximation error is given in C2
norm with ReLU3 networks (deep network with activation function max{0,x3 }) and
the statistical error is estimated by Rademacher complexity. We derive the bound on
the Rademacher complexity of the non-Lipschitz composition of gradient norm with
ReLU3 network, which is of immense independent interest.
AMS subject classifications: 62G05, 65N12, 65N15, 68T07

Key words: PINNs, ReLU3 neural network, B-splines, Rademacher complexity.
1 Introduction
Classical numerical methods such as the finite element method are successful to solve
the low-dimensional PDEs, see e.g., [6, 7, 13, 24, 30]. However these methods may en-
counter some difficulties in both theoretical analysis and numerical implementation for
∗ Corresponding author. Email addresses: yulingjiaomath@whu.edu.cn (Y. Jiao), laiyanming@whu.edu.cn

(Y. Lai), lidingv@whu.edu.cn (D. Li), xllv.math@whu.edu.cn (X. Lu), wangfr@whu.edu.cn (F. Wang),
yangwang@ust.hk (Y. Wang), zjyang.math@whu.edu.cn (J. Z. Yang)
http://www.global-sci.com/cicp 1272 c
2022 Global-Science Press
Y. Jiao et al. / Commun. Comput. Phys., 31 (2022), pp. 1272-1295 1273
the high-dimensional PDEs. Motivated by the facts that deep learning method for high-
dimensional data analysis has been achieved great successful applications in discrimi-
native, generative and reinforcement learning [9, 11, 28], solving high dimensional PDEs
with deep neural networks becomes an extremely potential approach and has attracted
a lot of attentions [2, 5, 10, 17, 25, 29, 32, 33]. Due to the excellent approximation ability
of the deep neural networks, several numerical schemes have been proposed to solve
PDEs with deep neural networks including the deep Ritz method (DRM) [32], physics-
informed neural networks (PINNs) [25], and deep Galerkin method (DGM) [33]. Both
DRM and DGM are applied to variational forms of PDEs, and PINNs are based on resid-
ual minimization to the differential equation, see [2, 17, 25, 29], which can be extended to
general PDEs [14, 16, 22, 23].
Despite the above mentioned deep PDEs solvers work well empirically, rigorous nu-
merical analysis for these methods are far from complete. The convergence rate of DRM
with two layer networks and deep networks are studied in [8, 12, 19, 20], the convergence
of PINNs are given in [21, 26, 27]. In this work, we will provide the nonasymptotic con-
vergence rate of the PINNs with ReLU3 networks, i.e., a quantitative error estimation
with respect to the topological structure of the neural networks (the depth and width)
and the number of the samples. Hence it gives a rule to determine the hyper-parameters
to achieve a desired accuracy. Our contributions are summarized as follows.
Our contributions and main results

• We obtain the approximation results ReLU3 network in C2 (Ω̄), see Theorem 3.1, i.e.,
∀ū ∈ C3 (Ω̄) and for any ǫ > 0, there exists a ReLU3 network uφ with depth ⌈log2 d⌉+2
d
and width C (d, kūkC3 (Ω̄) ) 1ǫ such that
kū − uφ kC2 (Ω̄) ≤ ǫ,
where d, Ω̄, C (d, kūkC3 (Ω̂) ) stands for the dimension of x, the closure of the domain
Ω and some numerical constant that only depends on (d, kū kC3 (Ω̂) ), respectively.
• We establish an upper bound of the statistical error for PINNs by applying the
tools of Pseudo dimension, especially we give an upper bound of the Rademacher
complexity to the derivative of ReLU3 network which is non-Lipschitz composi-
tion with ReLU3 network, via calculating the Pseudo dimension of networks with
ReLU, ReLU2 and ReLU3 activation functions, see Theorem 4.1. We prove that
∀D , W ∈ N and ǫ > 0, if the number of training samples in PINNs is with the order
2+ δ
O D 6 W 2 (D + log W ) 1ǫ , where δ is an arbitrarily positive number, then the
statistical error
b
E { Xk } N ,{Yk } M sup L(u)− L(u) ≤ ǫ,
k =1 k =1
u ∈P
where L and Lb are loss functions defined in (2.2) and (2.3) respectively.
1274 Y. Jiao et al. / Commun. Comput. Phys., 31 (2022), pp. 1272-1295
• Based on the above bounds on approximation error and statistical error, we estab-
lish a nonasymptotic convergence rate to PINNs for second order elliptic equations
with Dirichlet boundary condition, see Theorem 5.1, i.e., ∀ǫ > 0 if we set the depth
and width by
1
D = O(⌈log2 d⌉+ 2), W = O d
ǫ
in the ReLU3 network and set the number of training samples used in PINNs as
2d+4+δ
O( 1ǫ ), where δ is an arbitrarily positive number, then

E { Xk } N ,{Yk } M ubφ − u∗ 1 ≤ ǫ,
k =1 k =1 H 2 (Ω)
where ubφ is the minimizer in (2.4).
The paper is organized as follows. In Section 2 we describe the problem setting and
introduce the error decomposition for the PINNs. In Section 3 the approximation error of
ReLU3 networks in Sobolev spaces based on approximation results of B-splines is proved.
In Section 4 we provide the statistical error estimation by using the bound of Rademacher
complexity. In Section 5 the convergence rate of PINNs is shown. A conclusion and short
discussion is given in Section 6.
2 The PINNs method and error decomposition

2.1 Preliminary and PINNS method
We give the notations of neural networks and function spaces which will be used later.
Let D ∈ N, a function f is called a neural network if it is implemented by:
f0 (x) = x,
(ℓ)
fℓ (x) = ̺ ℓ ( Aℓ fℓ−1 + bℓ ) = (̺i (( Aℓ fℓ−1 + bℓ )i )) for ℓ = 1, ··· , D − 1,
f := fD (x) = AD fD−1 + bD ,
(ℓ) (ℓ)
where Aℓ = aij ∈ R nℓ ×nℓ−1 , bℓ = bi ∈ R nℓ and ̺ℓ : R nℓ → R nℓ is the active function. The
hyper-parameters D and W := max{ Nℓ , ℓ= 0, ··· , D} are called the depth and the width of
L
the network, respectively. Also ∑ℓ= 1 n ℓ is called the number of units of f and φ ={ A ℓ ,b ℓ }ℓ
are called the weight parameters. Let Φ be a set of activation functions and X be a Banach
space, we define the neural network function class by
N (D , W , {k·kX , B},Φ) := { f : f is implemented by a neural network with depth D

(ℓ)
and width W , k f kX ≤ B , and ̺i ∈ Φ for each i and ℓ}.
Then we introduce the function spaces and the partial differential equations what we are
interested in. The governing equation is defined in an open bounded domain Ω ⊂ R d with
smooth boundary ∂Ω, and without loss of generality we may assume that Ω ⊂ [0,1]d . Let
α = (α1 , ··· ,αn ) be an n-dimensional index with |α| := ∑ ni=1 αi and s be a nonnegative inte-
ger. The standard function spaces include continuous function space, L p space, Sobolev
spaces are given below.
C (Ω) := {all the continuous functions defined on Ω},

C s (Ω) := { f : Ω → R | D α f ∈ C (Ω)},
C (Ω̄) := {all the continuous functions defined on Ω̄}, k f kC(Ω̄) := max | f ( x)|,
x ∈Ω̄
s α α
C (Ω̄) := { f : Ω̄ → R | D f ∈ C (Ω̄)}, k f kCs (Ω̄) := max | D f ( x)|,
x ∈Ω̄,|α|≤ s
Z 1/p
Z
p
L (Ω) := f :Ω→R p
| f | dx < ∞ , k f k L p (Ω) := p
| f | ( x)dx , ∀ p ∈ [1,∞),
Ω Ω
L∞ (Ω) := { f : Ω → R | ∃C > 0 s.t. | f | ≤ C a.e.}, k f k L∞ (Ω) := inf{C | | f | ≤ C a.e.},
!1/p
p
W s,p (Ω) := { f : Ω → R | D α f ∈ L p (Ω), | α | ≤ s }, k f kW s,p (Ω) := ∑ k D α f k L p (Ω)
|α|≤ s
If s is a nonnegative real number, the fractional Sobolev space W s,p (Ω) can be defined as
follows: setting θ = s −⌊s⌋ and

Z Z | D α f ( x)− D α f (y)| p
W (Ω) := f : Ω → R
s,p
dxdy < ∞, ∀|α| = ⌊s⌋ ,
Ω Ω | x − y|θ p+ d
Z Z
! 1/p
p | D α f ( x)− D α f (y)| p
k f kW s,p (Ω) := k f kW ⌊s⌋,p (Ω) + ∑ dxdy .
|α|=⌊ s ⌋ Ω Ω
| x − y|θ p+ d
s,p
Let C0∞ (Ω) be the set of smooth functions with compact support in Ω, and W0 (Ω) is the
− s,q
completion space of C0∞ (Ω) in W s,p (Ω). For s < 0, W s,p (Ω) is the dual space of W0 (Ω)
with q satisfying 1p + 1q = 1. When p = 2, W s,p (Ω) is a Hilbert space and it is also denoted
by H s (Ω). The constant C may vary from place to place but it is independent to the
hyper-parameters of neural networks.
We will consider the following linear second order elliptic equation with Dirichlet
∂u 2u
boundary condition, where ∂x i
( x) and ∂x∂i ∂x j
( x) are denoted by u xi and u xi x j , respectively.

 d d
 −
∑ ij xi xj ∑ bi uxi + cu = f ,
a u + in Ω,
(2.1)
 i,j =1 i=1

eu = g, on ∂Ω,
where aij ∈ C (Ω̄), bi ,c ∈ L∞ (Ω), f ∈ L2 (Ω), e ∈ C (∂Ω), g ∈ L2 (∂Ω) with the strictly elliptic
condition, i.e., there exists a constant λ > 0 such that ∑i,j aij ξ i ξ j ≥ λ|ξ |2 , ∀ x ∈ Ω,ξ ∈ R d .
Define the constants by A = maxi,j {kaij kC(Ω̄) }, B = maxi {kbi k L∞ (Ω) }, C = kck L∞ (Ω) , E =
kekC(∂Ω) , F = k f k L2 (Ω) , G = k gk L2 (∂Ω) , and Z = {A,B,C,E,F,G}.
Instead of solving problem (2.1), we consider a minimization problem with the loss
functional L on C2 (Ω)∩ C (Ω̄):
Z
!2 Z
d d
L(u) := − ∑ aij ux x + ∑ bi ux + cu − f
i j i
dx + (eu − g)2 dx.
Ω i,j=1 i=1 ∂Ω
Assumption 2.1. Assume that (2.1) has a unique strong solution u∗ ∈ C2 (Ω)∩ C (Ω̄).
If Assumption 2.1 holds, then u∗ is also the unique minimizer of loss functional
R L.
Define |Ω| and
R |∂Ω| be the measure of Ω and its boundary respectively, i.e., |Ω| := Ω 1dx
and |∂Ω| := ∂Ω 1ds, then L can be equivalently written by
!2
d d
L(u) =|Ω|E X ∼U (Ω) − ∑ aij (X )ux x (X )+ ∑ bi (X )ux (X )+ c(X )u(X )− f (X )
i j i
i,j=1 i=1
+|∂Ω|EY∼U (∂Ω) (e(Y )u(Y )− g(Y ))2 , (2.2)
where U (Ω) and U (∂Ω) are uniform distribution on Ω and ∂Ω, respectively.
To solve minimization of L(u) numerically, a standard discrete version of L is given
by:
!2
N d d
b u) = | Ω |
N k∑
L( − ∑ aij ( Xk )u xi x j ( Xk )+ ∑ bi ( Xk )u xi ( Xk )+ c( Xk )u( Xk )− f ( Xk )
=1 i,j=1 i=1
|∂Ω| M
+ ∑ (e(Yk )u(Yk )− g(Yk ))2 , (2.3)
M k=1
where {Xk }kN=1 and {Yk }kM=1 are i.i.d. random samples according to the uniform distribu-
tion U (Ω) on Ω and U (∂Ω) on ∂Ω, respectively.
The PINNs method considers a minimization problem with respect to Lb:
b u φ ),
min L( (2.4)
u φ ∈P
where the admissible set P refers to the deep neural network function class parameter-
ized by φ. Assume there exists at least one minimizer to (2.4), which is denote by ubφ , and
we will give an error estimation to u∗ and ubφ for some carefully chosen admissible set P .
The definition of P will be given at beginning of Section 3.
Remark 2.1. In practical, it is difficult to compute ubφ precisely due to the nonlinear struc-
ture of P . People often call a (random) solver A, say SGD, to solve (2.4) and let the output
uφA be the final solution. In this work we ignore the optimization error and only consider
the error between u∗ and ubφ .
Remark 2.2. The existence of the minimizer to problem (2.4) depends on the choice of the
admissible set P . In case of the minimizer may not exist for some admissible set P , we
b ubφ ) ≤ infu ∈P L(
may consider a γ-optimal solution ubφ , i.e., L( b uφ )+ γ for a arbitrary small
φ
positive γ. The main results in this work are also true for the γ-optimal solution.
2.2 Error decomposition

The A priori estimation to Eq. (2.1) can be found in [1], and we list it below for complete-
ness.
1 T
Lemma 2.1. [1] For u ∈ H 2 (Ω) L2 (∂Ω),

d d

ku k 1 ≤ C − ∑ aij u xi x j + ∑ bi u xi + cu + C keuk L2 (∂Ω) .
H 2 (Ω) i,j=1 i=1
3
H − 2 (Ω)
Next we decompose the error into the approximation error and statistical error sepa-
rately.
T
Proposition 2.1. Assume that P ⊂ H 2 (Ω) C (Ω̄), then
kubφ − u∗ k2 1
H 2 (Ω)
h i
≤ C inf 3max{2d2 A2 ,dB2 ,C2 }kū − u∗ k2H2 (Ω) +|∂Ω|E2 kū − u∗ k2C(∂Ω)
ū∈P
| {z }
E ap p

b u) .
+ Csup L(u)− L(
u ∈P
| {z }
E sta
Proof. For any ū ∈ P , we have

b ū)+ L(
L ubφ −L(u∗ ) = Lb ubφ − L( b ū)−L(ū)+L( ū)−L(u∗ )

b u),
≤ [L( ū)−L(u∗ )]+ sup L(u)− L(
u ∈P

b ū) ≤ 0. Since ū can be any element in
where the last step is due to the fact that Lb ubφ − L(
P , we take the infimum for both side and obtain:

b u) .
L ubφ −L(u∗ ) ≤ inf [L( ū)−L(u∗ )]+ sup L(u)− L( (2.5)
ū∈P u ∈P
Since u∗ is the strong solution of Eq. (2.1) and L(u∗ ) = 0, then ∀u ∈ P we have
L(u)−L(u∗ )
Z
!2 Z
d d
= − ∑ aij ux x + ∑ bi ux + cu − f
i j i
dx + (eu − g)2 dx
Ω i,j=1 i=1 ∂Ω
Z
!2 Z
d d
∗ ∗ ∗
= − ∑ aij (u − u ) x i x j + ∑ bi ( u − u ) x i + c ( u − u ) dx + e2 (u − u∗ )2 dx
Ω i,j=1 i=1 ∂Ω
 !2 !2 
Z d d
≤3  ∑ aij (u − u∗ )x x i j
+ ∑ bi ( u − u ∗ ) x i
+(c(u − u∗ ))2  dx
Ω i,j=1 i=1
Z
+ e2 (u − u∗ )2 dx
∂Ω
≤3max{2d2 A2 ,dB2 ,C2 }ku − u∗ k2H2 (Ω) +|∂Ω|E2 ku − u∗ k2C(∂Ω) . (2.6)
On the other hand, by Lemma 2.1 and inequality k·k 3 ≤ C k·k L2 (Ω) , we have ∀u ∈
H − 2 (Ω)
1 T
H 2 (Ω) L2 (∂Ω)
L(u)−L(u∗ )
Z
!2 Z
d d
∗ ∗ ∗
= − ∑ aij (u − u ) xi x j + ∑ bi (u − u ) xi + c(u − u ) dx + e2 (u − u∗ )2 dx
Ω i,j=1 i=1 ∂Ω
≥ C k u − u ∗ k2 1 . (2.7)
H 2 (Ω)
Combining (2.5)-(2.7) yields the result.
The approximation error E app describes the expressive power of the parameterized
function class P in H 2 (Ω) and C (Ω̄) norm, which corresponds to the approximation
error in FEM known as the Céa’s lemma [7]. The statistical error Esta is caused by the
b in (2.3).
Monte Carlo discretization of L(·) defined in (2.2) with L(·)
3 Approximation error
T
We will choose P as a ReLU3 networks, to ensure P ⊂ H 2 (Ω) C (Ω̄). More precisely,
P = N (D , W , {k·kC2 (Ω̄) , B}, {ReLU3 }),
where the hyper-parameters B= 2ku∗ kC2 (Ω̄) and {D , W } will be given in later discussions
(i.e., Theorem 5.1) to ensure the desired accuracy. The ReLU3 network is refer to a neural
network where the activation function ρ( x) is given by

(
x3 , x ≥ 0,
ρ( x ) = (3.1)
0, others.
It can be seen that ρ( x) is twice differentiable.

Theorem 3.1. Given ū ∈ C3 (Ω̄) and for any ǫ > 0, there exists a ReLU3 network uφ with depth
d
⌈log2 d⌉+ 2 and width C (d, kūkC3 (Ω̄) ) 1ǫ such that kū − uφ kC2 (Ω̄) ≤ ǫ.
Proof. A special case in [15].
4 Statistical error
In this section we give the statistical error with parameterized function class P , by estab-
lishing the Rademacher complexity of the non-Lipschitz composition of ReLU3 network
and its partial derivative. The technique used here may be helpful to analysis the statis-
tical errors for other deep PDEs solvers, where the main difficulties are also to estimated
the Rademacher complexity of non-Lipschitz composition induced by the derivative op-
erator.
Firstly by carefully computation and triangle inequality, we have the following lemma.
Lemma 4.1.
13
b u) ≤ ∑ E cj (u) ,
E { Xk } N M sup L(u)− L( { Xk } N M sup L j (u)− L
k =1,{Yk } k =1 k =1,{Yk } k =1
u ∈P j=1 u ∈P
where
!2
d
L1 ( u ) = | Ω |E X ∼ U ( Ω ) ∑ aij (X )ux x (X ) i j
,
i,j=1
!2
d
L2 ( u ) = | Ω |E X ∼ U ( Ω ) ∑ bi ( X ) u x ( X )
i
,
i=1
L3 (u) = |Ω|E X ∼U (Ω) (c( X )u( X ))2 , L4 ( u ) = | Ω |E X ∼ U ( Ω ) f ( X ) 2 ,

! !
d d
L 5 ( u ) = − 2| Ω | E X ∼ U ( Ω ) ∑ aij (X )ux x (X )
i j ∑ bi ( X ) u x ( X )
i
,
i,j=1 i=1
!
d
L 6 ( u ) = − 2| Ω | E X ∼ U ( Ω ) ∑ aij (X )ux x (X )
i j
· c ( X ) u ( X ),
i,j=1
!
d
L 7 ( u ) = 2| Ω | E X ∼ U ( Ω ) ∑ aij (X )ux x (X ) i j
· f ( X ),
i,j=1
!
d
L 8 ( u ) = 2| Ω | E X ∼ U ( Ω ) ∑ bi ( X ) u x ( X )
i
· c ( X ) u ( X ),
i=1
!
d
L 9 ( u ) = − 2| Ω | E X ∼ U ( Ω ) ∑ bi ( X ) u x ( X )
i
· f ( X ),
i=1
L10 (u) = −2|Ω|E X ∼U (Ω) c( X )u( X ) f ( X ), L11 (u) = |∂Ω|EY∼U (∂Ω) (e(Y )u(Y ))2 ,
L12 (u) = |∂Ω|EY∼U (∂Ω) ( g(Y ))2 , L13 (u) = −2|∂Ω|EY∼U (∂Ω) e(Y )u(Y ) g(Y ),
and Lbj (u) is the discrete sample version of L j (u) by replacing expectation with sample average
j = 1, ··· ,13.
4.1 Rademacher complexity, covering number and pseudo-dimension
By the technique of symmetrization, we can bound the difference between continuous

loss Li and empirical loss Lbi via Rademacher complexity.
Definition 4.1. [31] The Rademacher complexity of a set A ⊆ R N is defined by

" #
1 N
R( A) = E {σi } N sup ∑ σk ak ,
a∈ A N k=1
k =1
where {σk }kN=1 are N i.i.d Rademacher variables with P (σk = 1) = P (σk = −1) = 21 .
Let Ω be a set and F be a function class which maps Ω to R. Let P be a probability
distribution over Ω and {Xk }kN=1 be i.i.d. samples from P. The Rademacher complexity
of F associated with distribution P and sample size N is defined by
" #
1 N
RP,N (F ) = E { Xk ,σk } N sup ∑ σk u( Xk ) .
u ∈F N k=1
k =1
For Rademacher complexity, we have following structural result.
Lemma 4.2. Let Ω be a set and P be a probability distribution over Ω. Let N ∈ N. Assume that
w : Ω → R and |w( x)| ≤ B for all x ∈ Ω, then for any function class F mapping Ω to R, there
holds
RP,N (w( x)F ) ≤ B RP,N (F ),
where w( x)F := {ū : ū( x) = w( x)u( x), u ∈ F }.

Proof.
N
1
RP,N (w( x)·F ) = E { Xk ,σk } N sup ∑ σk w( Xk )u( Xk )
N k =1
u ∈F k=1
" #
N
1
= E N E N sup w ( X1 ) u ( X1 )+ ∑ σk w ( Xk ) u ( Xk )
2N { Xk }k =1 {σk }k =2 u∈F k=2
" #
N
1
+ E N E N sup − w ( X1 ) u ( X1 )+ ∑ σk w ( Xk ) u ( Xk )
2N { Xk }k =1 {σk }k =2 u∈F k=2
1
= E N E N
2N {"Xk }k =1 {σk }k =2 #
N N
′ ′
sup w( X1 )[u( X1 )− u ( X1 )]+ ∑ σk w( Xk )u( Xk )+ ∑ σk w( Xk )u ( Xk )
u,u ′ ∈F k=2 k=2
1
≤ E N E N
2N {"Xk }k =1 {σk }k =2 #
N N
sup B|u( X1 )− u′ ( X1 )|+ ∑ σk w( Xk )u( Xk )+ ∑ σk w( Xk )u′ ( Xk )
u,u ′ ∈F k=2 k=2
1
= E N E N
2N {"Xk }k =1 {σk }k =2 #
N N
sup B[u( X1 )− u′ ( X1 )]+ ∑ σk w( Xk )u( Xk )+ ∑ σk w( Xk )u′ ( Xk )
u,u ′ ∈F k=2 k=2
" #
N
1
= E{ Xk ,σk } N sup σ1 B u( X1 )+ ∑ σk w( Xk )u( Xk )
N k =1
u ∈F k=2
N
B
≤ ··· ≤ E { Xk ,σk } N sup ∑ σk u( Xk ) = B RP,N (F ).
N k =1
u ∈F k=1
This completes the proof.
Lemma 4.3. Let {Xk }kN=1 be i.i.d. samples from U (Ω), then we have

cj (u) ≤ Cd ZRU (Ω),N (F j ),
E { Xk } N sup L j (u)− L j = 1, ··· ,13,
k =1
u ∈P
where
F1 = {± f : Ω → R | ∃u ∈ P and 1 ≤ i, j,i′ , j′ ≤ d s.t. f ( x) = u xi x j ( x)u xi′ x j′ ( x)},

F2 = {± f : Ω → R | ∃u ∈ P and 1 ≤ i,i′ ≤ d s.t. f ( x) = u xi ( x)u xi′ ( x)},
2
F3 = {± f : Ω → R | ∃u ∈ P s.t. f ( x ) = u ( x ) }, F4 = {± f : Ω → R | − 1,0,1},
F5 = {± f : Ω → R | ∃u ∈ P and 1 ≤ i, j,i′ ≤ d s.t. f ( x) = u xi x j ( x)u xi′ ( x)},

F6 = {± f : Ω → R | ∃u ∈ P and 1 ≤ i, j ≤ d s.t. f ( x) = u xi x j ( x)u( x)},
F7 = {± f : Ω → R | ∃u ∈ P and 1 ≤ i, j ≤ d s.t. f ( x) = u xi x j ( x)},
F8 = {± f : Ω → R | ∃u ∈ P and 1 ≤ i ≤ d s.t. f ( x) = u xi ( x)u( x)},
F9 = {± f : Ω → R | ∃u ∈ P and 1 ≤ i ≤ d s.t. f ( x) = u xi ( x)},
F10 = {± f : Ω → R | ∃u ∈ P s.t. f ( x) = u( x)},
F11 = {± f : ∂Ω → R | ∃u ∈ P s.t. f = u2 |∂Ω },
F12 = {± f : ∂Ω → R | − 1,0,1},
F13 = {± f : ∂Ω → R | ∃u ∈ P s.t. f = u|∂Ω }.

Proof. We only give the proof of E { Xk } N supu∈P L1 (u)− L c1 (u) ≤ 4|Ω|A2 d4 R(F1 ) since
k =1
other inequalities can be shown similarly. We take {X fk } N as an independent copy of
k=1
N
{Xk }k=1 , then

c
L1 (u)− L1 (u)
!2 !2

d
1 N d
= |Ω| E X ∼U (Ω) ∑ aij ( X )u xi x j ( X ) − ∑ ∑ aij ( Xk )u xi x j ( Xk )
i,j=1
N k=1 i,j=1
! !2
2
1 N d N d
= |Ω| E{ Xfk } N ∑ ∑ aij ( X fk ) − 1 ∑ ∑ aij ( Xk )u x x ( Xk )
fk )u x x ( X
k =1 N
i j
N k=1 i,j=1 i j
k=1 i,j=1
d N
|Ω | fk )u x x ( X
fk ) ai′ j′ ( X fk )
fk )u x ′ x ′ ( X
≤
N
E { Xf } N
k k =1
′
∑ ′
| ∑ aij ( X i j i j
i,i ,j,j =1 k=1
− aij ( Xk ) ai′ j′ ( Xk )u xi x j ( Xk )u xi′ x j′ ( Xk )|.

Hence

c1 (u)
E { Xk } N sup L1 (u)− L
k =1
u ∈P
d
|Ω |
≤
N
E { Xk } N sup E { Xf } N
k =1 k k =1
∑
u ∈P i,i′ ,j,j′ =1

N
f f f f
∑ ij k i j k xi x j k xi′ x j′ k
a ( X ) a ′ ′ ( X ) u ( X ) u ( X )− a ij ( X ) a
k ij ′ ′ ( X ) u
k xi x j ( X ) u
k xi′ x j′ ( X )
k
k=1
d
|Ω |
≤ E { Xk } N E { Xf } N sup ∑
N k =1 k k =1
u ∈P i,i′ ,j,j′ =1

N
fk )u x x ( X
fk ) ai′ j′ ( X fk )− aij ( Xk ) ai′ j′ ( Xk )u x x ( Xk )u x ′ x ′ ( Xk )
fk )u x ′ x ′ ( X
∑ aij ( X
k=1 i j i j i j i j

d
|Ω |
= E { Xk } N E { Xf } N sup ∑ E {σk } N
N k =1 k k =1
u ∈P i,i′ ,j,j′ =1
k =1

N h i
fk )u x x ( X
fk ) ai′ j′ ( X fk )− aij ( Xk ) ai′ j′ ( Xk )u x x ( Xk )u x ′ x ′ ( Xk )
fk )u x ′ x ′ ( X
∑ σk aij ( X
k=1 i j i j i j i j

|Ω | d N h i
fk )u x x ( X
fk ) ai′ j′ ( X fk )
fk )u x ′ x ′ ( X
≤ E { X ,Xf ,σ } N sup ∑ ∑ σk aij ( X
u ∈P i,i′ ,j,j′ =1 k=1
i j
N k k k k =1 i j

|Ω | d N h i

+ E N sup
N { Xk ,Xfk ,σk }k =1 u∈P i,i′ ,j,j ∑′ ∑ σk aij (Xk )ai′ j′ (Xk )uxi xj (Xk )uxi′ xj′ (Xk )
=1 k=1

2| Ω | d N h i

= E N sup
N { Xk ,σk }k =1 u∈P i,i′ ,j,j ∑′ ∑ σk aij (Xk )ai′ j′ (Xk )uxi xj (Xk )uxi′ xj′ (Xk )
=1 k=1

2| Ω | d N h i

≤ ∑
N i,i′ ,j,j′ =1
E { Xk ,σk } N sup ∑ σk aij ( Xk ) ai′ j′ ( Xk )u xi x j ( Xk )u xi′ x j′ ( Xk )
k = 1
u ∈P k=1
2| Ω | d N
= ∑
N i,i′ ,j,j′ =1
E N
{ Xk ,σk } k =1 sup ∑ σk aij ( Xk ) ai′ j′ ( Xk ) f i,i′ ,j,j′ ( Xk )
f ′ ′ ∈F1 k=1
i,i ,j,j
2|Ω|A2 d N
≤ ∑
N i,i′ ,j,j′ =1
E { Xk ,σk } N
k =1
sup ∑ σk f i,i′ ,j,j′ ( Xk )
f ′ ′ ∈F1 k=1
i,i ,j,j
2 4
≤ 2|Ω|A d RU (Ω),N (F1 ),
where the second and eighth steps are from Jensen’s inequality and Lemma 4.2, the third
and seventh steps are due to the facts that the insertion of Rademacher variables doesn’t
change the distribution and that F1 is symmetric(i.e., if f ∈F1 , then − f ∈F1 ), respectively.
Next we give an upper bound of Rademacher complexity in terms of the covering

number of the corresponding function class.
Definition 4.2. [3] Suppose that W ⊂ R n . For any ǫ > 0, let V ⊂ R n be an ǫ-cover of
W with respect to the distance d∞ , that is, for any w ∈ W, there exists a v ∈ V such that
d∞ (u,v) < ǫ, where d∞ is defined by d∞ (u,v) := max1≤i≤n |ui − vi |. The covering number
C(ǫ,W,d∞ ) is defined to be the minimum cardinality among all ǫ-cover of W with respect
to the distance d∞ .
Definition 4.3. [3] Suppose that F is a class of functions from Ω to R. Given n sample
Zn = ( Z1 ,Z2 , ··· ,Zn ) ∈ Ωn , F |Zn ⊂ R n is defined by
F |Zn = {(u( Z1 ),u( Z2 ), ··· ,u( Zn )) : u ∈ N 3 }.

The uniform covering number C∞ (ǫ, F ,n) is defined by
C∞ (ǫ, F ,n) = maxn C(ǫ, F |Zn ,d∞ ).

Zn ∈Ω
Lemma 4.4. Let Ω be a set and P be a probability distribution over Ω. Let N ∈ N ≥1 . Let F
be a class of functions from Ω to R such that 0 ∈ F and the diameter of F is less than B , i.e.,
kuk L∞ (Ω) ≤ B , ∀u ∈ F . Then
Z q
12 B
RP,N (F ) ≤ inf 4δ + √ log(2C∞ (ǫ, F , N ))dǫ .
0< δ<B N δ
Proof. The proof is based on the chaining method, see [31].
By Lemma 4.4, we have to bound the covering number, which can be upper bounded
via Pseudo-dimension [3].
Definition 4.4. Let F be a class of functions from X to R. Suppose that S={ x1 ,x2 , ··· ,xn }⊂
X. We say that S is pseudo-shattered by F if there exists y1 ,y2 , ··· ,yn such that for any
b ∈ {0,1}n , there exists a u ∈ F satisfying
sign(u( xi )− yi ) = bi , i = 1,2, ··· ,n,
and we say that {yi }ni=1 witnesses the shattering. The pseudo-dimension of F , denoted
as Pdim(F ), is defined to be the maximum cardinality among all sets pseudo-shattered
by F .
The following proposition showing a relation between uniform covering number and
pseudo-dimension.
Proposition 4.1 (Theorem 12.2 [3]). Let F be a class of real functions from a domain X to
the bounded interval [0, B]. Let ǫ > 0. Then
Pdim (F ) i
n B
C∞ (ǫ, F ,n) ≤ ∑ i ǫ
,
i=1
en B
Pdim(F )
which is less than ǫ ·Pdim(F )
for n ≥ Pdim(F ).
4.2 Bound on statistical error

By Lemmas 4.1, 4.3, 4.4 and Proposition 4.1, we can bound the statistical error via bound-
ing the pseudo-dimension of Fi , i = 1, ··· ,13. To this end, we show that {Fi } are subsets of
some neural network classes and then bound the pseudo-dimension of associate neural
network classes.
In later discussions, a “ReLU2 − ReLU3 network” is a network with activation func-
tions be either ReLU2 or ReLU3 , and other terminology such as “ReLU− ReLU2 − ReLU3 ”
are defined similarly.
Proposition 4.2. Let u be a function implemented by a ReLU2 − ReLU3 (ReLU3 ) network

∂u
with depth D and width W . Then for i = 1, ··· ,d, Di u := ∂x i
can be implemented by a
2 3 2 3
ReLU− ReLU − ReLU (ReLU − ReLU ) network with depth D+ 2 and width (D + 2)W .
Moreover, the neural networks implementing { Di u}di=1 have the same architecture.
Proof. For activation function ρ in each unit, we denote ρe as its derivative, i.e., ρe( x)=ρ′ ( x).
We then have (
2ReLU, ρ = ReLU2 ,
ρe( x) =
3ReLU2 , ρ = ReLU3 .
Let 1 ≤ i ≤ d. We deal with the first two layers in details and apply induction for layers
k ≥ 3 since there are a little bit difference for the first two layer. For the first layer, we have
for any q = 1,2 ··· ,n1
! !
d d
( 1) ( 1) ( 1) ( 1) g( 1) ( 1) ( 1) ( 1)
Di u q = Di ρq ∑ aqj x j + bq = ρq ∑ aqj x j + bq · aqi .
j=1 j=1
( 1)
Hence Di uq can be implemented by a ReLU− ReLU2 − ReLU3 network with depth 2 and
width 1. For the second layer,
! !
n1 n1 n1
( 2) ( 2) ( 2) ( 1) ( 2) g( 2) ( 2) ( 1) ( 2) ( 2) ( 1)
Di u q = Di ρq ∑ aqj u j + bq = ρq ∑ aqj u j + bq · ∑ aqj Di u j .
j=1 j=1 j=1
g( 2) ( 2) ( 1) ( 2) ( 2) ( 1)
Since ρq ∑nj=1 1 aqj u j + bq and ∑nj=1 1 aqj Di u j can be implemented by two ReLU −
ReLU2 −ReLU3 subnetworks, respectively, and the multiplication can also be implemented
by
1
x·y = ( x + y)2 −( x − y)2
4
1
= ReLU2 ( x + y)+ ReLU2 (− x − y)− ReLU2 ( x − y)− ReLU2 (− x + y) , (4.1)
4
( 2)
we conclude that Di uq can be implemented by a ReLU − ReLU2 − ReLU3 network. We
have !! !!
n1 n1
g( 2) ( 2) ( 1) ( 2) g( 2) ( 2) ( 1) ( 2)
D ρq ∑ aqj u j + bq = 3, W ρq ∑ aqj u j + bq ≤W
j=1 j=1
and ! !
n1 n1
( 2) ( 1) ( 2) ( 1)
D ∑ aqj Di u j = 2, W ∑ aqj Di u j ≤ W.
j=1 j=1
( 2) ( 2)
Thus D Di uq = 4, W Di uq ≤ max{2W ,4}. Now we apply induction for layers k ≥ 3.
For the third layer,
! !
n2 n2 n2
( 3) ( 3) ( 3) ( 2) ( 3) g
( 3) ( 3) ( 2) ( 3) ( 3) ( 2)
Di u q = Di ρq ∑ aqj u j + bq = ρq ∑ aqj u j + bq · ∑ aqj Di u j .
j=1 j=1 j=1
Since
!! !!
n2 n2
( 3) ( 3) ( 2) ( 3) ( 3) ( 3) ( 2) ( 3)
D ρq ∑ aqj u j + bq = 4, W ρq ∑ aqj u j + bq ≤W
j=1 j=1
and ! !
n2 n1
( 3) ( 2) ( 3) ( 2)
D ∑ aqj Di u j = 4, W ∑ aqj Di u j ≤ max{2W ,4W } = 4W ,
j=1 j=1
( 3)
we conclude that Di uq can be implemented by a ReLU − ReLU2 − ReLU3 network and
( 3) ( 3)
D Di uq = 5, W Di uq ≤ max{5W ,4} = 5W .
( k)
We assume that Di uq (q = 1,2, ··· ,nk ) can be implemented by a ReLU-ReLU2 network
( k) ( 3)
and D Di uq = k + 2, W Di uq ≤ (k + 2)W . For the (k + 1)−th layer,
!
nk
( k + 1) ( k + 1) ( k + 1) ( k ) ( k + 1)
Di u q = Di ρq ∑ aqj u j + bq
j=1
!
nk nk
^( k + 1) ( k + 1) ( k ) ( k + 1) ( k + 1) ( k)
= ρq ∑ aqj u j + bq · ∑ aqj Di u j .
j=1 j=1
Since
!!
nk
^( k + 1) ( k + 1) ( k ) ( k + 1)
D ρq ∑ aqj u j + bq = k + 2,
j=1
!!
nk
^( k + 1) ( k + 1) ( k ) ( k + 1)
W ρq ∑ aqj u j + bq ≤W
j=1
and
!
nk
( k + 1) ( k)
D ∑ aqj Di u j = k + 2,
j=1
!
nk
( k + 1) ( k)
W ∑ aqj Di u j ≤ max{(k + 2)W ,4W } = (k + 2)W ,
j=1
( k + 1)
we conclude that Di uq can be implemented by a ReLU − ReLU2 − ReLU3 network and

( k + 1) ( k + 1)
D Di u q = k + 3, W Di u q ≤ max{(k + 3)W ,4} = (k + 3)W .
Hence we derive that Di u = Di u1D can be implemented by a ReLU − ReLU2 − ReLU3 net-
work and D ( Di u) = D + 2, W ( Di u) ≤ (D + 2)W . And through our argument, we know
that the neural networks implementing { Di u}di=1 have the same architecture.
We now present the bound for pseudo-dimension of N (D , W , {ReLU,ReLU2 ,ReLU3 })

with D , W ∈ N. We need the following Lemma.
Lemma 4.5 (Theorem 8.3 in [3]). Let p1 , ··· , pm be polynomials with n variables of degree at
most d. If n ≤ m, then
n
n 2emd
|{(sign( p1 ( x)), ··· ,sign( pm ( x))) : x ∈ R }| ≤ 2 .
n
Proposition 4.3. For any D , W ∈ N,
Pdim(N (D , W , {ReLU,ReLU2 ,ReLU3 })) = O(D 2 W 2 (D + log W )).
Proof. The argument is follows from the proof of Theorem 6 in [4]. The result stated here
is somewhat stronger then Theorem 6 in [4] since VCdim(sign(F )) ≤ Pdim (F ) for any
function class F . We consider a new set of functions
f = {ue( x,y) = sign(u( x)− y) : u ∈ N (D , W , {ReLU,ReLU2 ,ReLU3 })}.

N
It is clear that Pdim(N (D , W , {ReLU,ReLU2 ,ReLU3 }))≤ VCdim(Nf). We now bound the
VC-dimension of N f. Denoting M as the total number of parameters (weights and biases)
in the neural network implementing functions in N , in our case we want to derive the
uniform bound for
K{ xi },{yi } (m) := |{(sign( f ( x1 ,a)− y1 ), ··· ,sign(u( xm ,a)− ym )) : a ∈ R M }|
over all { xi }m m
i=1 ⊂ X and {yi }i=1 ⊂ R. Actually the maximum of K { xi },{ yi } ( m ) over all
m m
{ xi }i=1 ⊂ X and {yi }i=1 ⊂ R is the growth function GN f( m ). In order to apply Lemma
M
4.5, we partition the parameter space R into several subsets to ensure that in each
subset u( xi ,a)− yi is a polynomial with respect to a without any breakpoints. In fact, our
partition is exactly the same as the partition in [4]. Denote the partition as { P1 ,P2 , ··· ,PN }
with some integer N satisfying
D−1 Mi
2emki (1 +(i − 1)3i−1 )
N≤ ∏ 2 , (4.2)
i=1
Mi
where ki and Mi denotes the number of units at the ith layer and the total number of
parameters at the inputs to units in all the layers up to layer i of the neural network
implementing functions in N , respectively. See [4] for the construction of the partition.
Obviously we have
N
K{ xi },{yi } (m) ≤ ∑ |{(sign(u( x1 ,a)− y1 ), ··· ,sign(u( xm ,a)− ym )) : a ∈ Pi }|. (4.3)
i=1
Note that u( xi ,a)− yi is a polynomial with respect to a with degree the same as the degree
of u( xi ,a), which is equal to 1 +(D − 1)3D−1 as shown in [4]. Hence by Lemma 4.5, we
have
|{(sign(u( x1 ,a)− y1 ), ··· ,sign(u( xm ,a)− ym )) : a ∈ Pi }|

MD
2em(1 +(D − 1)3D−1 )
≤2 . (4.4)
MD
Combining (4.2), (4.3) and (4.4) yields
D Mi
2emki (1 +(i − 1)3i−1 )
K { xi },{ yi } ( m ) ≤ ∏ 2 .
i=1
Mi
We then have Mi

D
2emki (1 +(i − 1)3i−1 )
f( m ) ≤ ∏ 2
GN ,
i=1
Mi
since the maximum of K{ xi },{yi } (m) over all { xi }m m
i=1 ⊂ X and {yi }i=1 ⊂ R is the growth
function GN
f( m ). Doing some algebras as that of the proof of Theorem 6 in [4], we obtain

Pdim(N (D , W , {ReLU,ReLU2 ,ReLU3 })) ≤ O D 2 W 2 (D + log W ) .
With the above preparations, we are able to derive our result on the statistical error.
Theorem 4.1. Let D , W ∈ N, B ∈ R + . For any ǫ > 0, if the number of sample
2+ δ
6 1 2
N, M = C (d,Ω,Z, B)D W (D + log W ) ,
ǫ
where δ is an arbitrarily small number, then we have

b u) ≤ ǫ,
E { Xk } N ,{Yk } M sup L(u)− L(
k =1 k =1
u ∈P
where P = N (D , W , {k·kC2 (Ω̄) , B}, {ReLU3 }).

Proof. We need the following Lemma. Let Φ = {ReLU,ReLU2 ,ReLU3 }.
Lemma 4.6. Let {Fi }13

i=1 be defined in Lemma 4.3. There holds
F1 ⊂ N1 := N (D + 5,2(D + 2)(D + 4)W , {k·kC(Ω̄ ) , B 2 },Φ),

F2 ⊂ N2 := N (D + 3,2(D + 2)W , {k·kC(Ω̄ ) , B 2 },Φ),
F3 ⊂ N3 := N (D + 1, W , {k·kC(Ω̄) , B 2 },Φ),
F5 ⊂ N5 := N (D + 5, (D + 2)(D + 5)W , {k·kC(Ω̄ ) , B 2 },Φ),
F6 ⊂ N6 := N (D + 3, (D + 3)W , {k·kC(Ω̄) , B 2 },Φ),
F7 ⊂ N7 := N (D + 4, (D + 2)(D + 4)W , {k·kC(Ω̄ ) , B},Φ),
F8 ⊂ N8 := N (D + 3, (D + 3)W , {k·kC(Ω̄) , B 2 },Φ),
F9 ⊂ N9 := N (D + 2, (D + 2)W , {k·kC(Ω̄) , B},Φ),
F10 ⊂ N10 := N (D , W , {k·kC(Ω̄) , B},Φ),
F11 ⊂ N11 := N (D + 1, W , {k·kC(Ω̄) , B 2 },Φ),
F13 ⊂ N13 := N (D , W , {k·kC(Ω̄) , B},Φ).
Proof. By Proposition 4.2, we know that for u is a ReLU3 network with depth D and
width W , u xi can be implemented by a ReLU2 − ReLU3 network with depth D + 2 and
width (D + 2)W . Then by Proposition 4.2 again we have that u xi x j can be implemented
by a ReLU−ReLU2 −ReLU3 network with depth D+4 and width (D+2)(D+4)W . These
facts combining with (4.1) yields the results.
By Lemma 4.4 and Proposition 4.1, we have for i = 1,2, ··· ,10,
Z q
12 Bi
RU (Ω),N (Fi ) ≤ inf 4δ + √ log(2C (ǫ, Fi , N ))dǫ
0< δ<Bi N δ
 v 
Z Bi u
u Pdim(Fi ) !
12 tlog 2 eN Bi
≤ inf 4δ + √ dǫ
0< δ<Bi N δ ǫ · Pdim(Fi )
s !
Z
12Bi 12 Bi eN Bi
≤ inf 4δ + √ + √ Pdim(F1 )· log dǫ . (4.5)
0< δ<Bi N N δ ǫ · Pdim (Fi )
q
eN Bi eN Bi 2
Now we calculate the integral. Let t = log( ǫ·Pdim (F )
), then ǫ = Pdim (F )
· e−t . Denoting
i i
s s
eN Bi eN Bi
t1 = log , t2 = log ,
Bi · Pdim(Fi ) δ · Pdim(Fi )
we have
s
Z Bi Z t2
eN Bi 2eN Bi 2
log dǫ = t2 e−t dt
δ ǫ · Pdim(Fi ) Pdim(Fi ) t1
Z t2 2
!′
2eN Bi − e− t
= t dt
Pdim(Fi ) t1 2
Z t2
eN Bi − t21 − t22 − t2
= t1 e − t2 e + e dt
Pdim(Fi ) t1
eN Bi h −t21 2 2
i
≤ t1 e − t2 e−t2 +(t2 − t1 )e−t1
Pdim(Fi )
s
eN Bi 2
− t1 eN Bi
≤ · t2 e = kuk log . (4.6)
Pdim(Fi ) δ · Pdim(Fi )
Pdim(F i ) 1/2
Combining (4.5) and (4.6) and choosing δ = Bi N ≤ Bi , we get for i =
1,2,3,5,6,7,8,9,10,
s !
Z
12Bi 12 Bi eN Bi
RU (Ω),N (Fi ) ≤ inf 4δ + √ + √ Pdim(Fi )· log dǫ
0< δ<Bi N N δ ǫPdim(Fi )
p s !
12Bi 12Bi Pdim(Fi ) eN Bi
≤ inf 4δ + √ + √ log
0< δ<Bi N N δ · Pdim(Fi )
r 1/2 s
3 Pdim(Fi ) eN
≤ 28 Bi log
2 N Pdim(Fi )
r s
3 Pdim(Ni ) 1/2 eN
≤ 28 Bi log
2 N Pdim(Ni )
r 1/2 s
3 2 H eN
≤ 28 max{B , B } log , (4.7)
2 N H
with
H = 4C (D + 2)2 (D + 4)2 (D + 5)2 W 2 (D + 5 + log((D + 2)(D + 4)W )),
where in the forth step we apply Lemma 4.6 and we use Proposition 4.3 in the last step.
Similarly for i = 11,13,
r 1/2 s
3 2 H eM
RU (Ω),N (Fi ) ≤ 28 max{B , B } log . (4.8)
2 M H
Obviously, RU (Ω),N (F4 ) and RU (Ω),N (F12 ) can be bounded by the right hand side of (4.7)
and (4.8), respectively. Combining Lemma 4.1 and 4.3 and (4.7) and (4.8), we have

b u)
k =1 k =1
u ∈P
13
c
≤ ∑ E { Xk } N M sup L j (u)− L j (u)
k =1,{Yk } k =1
j=1 u ∈P
r 1/2 s 1/2 s !
3 H eN H eM
≤ 28 max{B , B 2 } 40|Ω|C1 log + 12|∂Ω|C2 log ,
2 N H M H
where
C1 = max{A2 d4 ,B2 d2 ,C2 d2 ,F2 ,ABd3 ,ACd2 ,AFd2 ,BCd,BFd,CF},
C2 = max{E2 ,G2 ,EG}.
Hence for any ǫ > 0, if the number of sample
2+ δ
1
N, M = C (d,Ω,Z, B)D 6 W 2 (D + log W )
ǫ
with δ being an arbitrarily small number, then we have

b u) ≤ ǫ.
k =1 k =1
u ∈P
5 Convergence rate for the PINNs

With the preparation in last two sections on the bounds of approximation and statistical
errors, we will give the main results in this section.
Theorem 5.1. Let Assumption 2.1 holds true and further assume u∗ ∈ C3 (Ω̄). For any ǫ > 0, we
choose the parameterized neural network class
d !
∗ 1 ∗ 3
P = N ⌈log2 d⌉+ 2,C (d,Ω,Z, ku kC3 (Ω̄) ) , {k·kC2 (Ω̄) ,2ku kC2 (Ω̄) }, {ReLU }
ǫ
and let the number of samples be

2d+4+δ
∗ 1
N, M = C (d,Ω,Z, ku kC3 (Ω̄) ) ,
ǫ
where δ is an arbitrarily positive number, then we have

E N M
ûφ − u∗ 1 ≤ ǫ.
{ Xk } k =1,{Yk } k =1 H 2 (Ω)
Proof. For any ǫ > 0, by Theorem 3.1, there exists an neural network function uφ with
d
depth ⌈log2 d⌉+ 2 and width C (d,Ω,Z, ku∗ kC3 (Ω̄) ) 1ǫ such that
1/2
∗ ǫ2
ku − uφ kC2 (Ω̄) ≤ .
3C (d2 + 3d + 4)|Ω| max {2d2 A2 ,dB2 ,C2 }+ 2C |∂Ω|E2
Without loss of generality we assume that ǫ is small enough such that
kuφ kC2 (Ω̄) ≤ ku∗ − uφ kC2 (Ω̄) +ku∗ kC2 (Ω̄) ≤ 2ku∗ kC2 (Ω̄) .
Hence uφ belongs to the function class

d !
1
P =N ⌈log2 d⌉+ 2,C (d,Ω,Z, ku∗ kC3 (Ω̄) ) , {k·kC2 (Ω̄) ,2ku∗ kC2 (Ω̄) }, {ReLU3 } .
ǫ
And
3 ǫ2
E app ≤ (d2 + 3d + 4)|Ω| max {2d2 A2 ,dB2 ,C2 }kuφ − u∗ k2C2 (Ω̄) +|∂Ω|E2 kuφ − u∗ k2C2 (Ω̄) ≤ .
2 2C
(5.1)
By Theorem 4.1, when the number of sample
4+ δ 2d+4+δ
6 2 1 ∗ 1
N, M = C (d,Ω,Z, B)D W (D + log W ) = C (d,Ω,Z, ku kC3 (Ω̄) )
ǫ ǫ
with δ being an arbitrarily positive number, we have

2
Esta = E{ Xk } N M
b u) ≤ ǫ .
sup L(u)− L( (5.2)
k =1,{Yk } k =1 2C
u ∈P
Combining Proposition 2.1, (5.1) and (5.2) yields the result.
Remark 5.1. The asymptotic convergence results for PINNs have been studied in [21, 26,
27], i.e., when the number of parameters in the neural networks and number of training
samples go to infinity, the solution of PINNs with converges to the solution of the PDEs’.
Our work establishes a nonasymptotic convergence rate of PINNs. According to our
results in Theorem 5.1, the influence of the depth and width in the neural networks and
number of training samples are characterized quantitatively. It gives an answer on how
to choose the hyperparameters to archive the desired accuracy, which is missing in [21,
26, 27].
Remark 5.2. As shown in Theorem 5.1, the PINNs suffers from the curse of dimension-
ality. However, if no additional smoothness or low-dimensional compositional structure
on the underlying solutions of PDEs are imposed, the curse is unavoidable. Recently, the
minimax lower bound for PINNs are proved in [18], where they showed that to achieve
error of ǫ for PINNs to solve elliptical equation defined on [0,1]d whose solution lives
in H 1 ([0,1]d ), the number of samples n are lower bounded by O((1/ǫ)d ). In [19], the
authors reduce the curses by assuming the underlying solutions living in Barron spaces
which is much smaller than H 1 . One can also utilize the structures of the solution to
reduce the curse, for example by considering PDEs whose solution are composition of
functions with number of variables much smaller than d.
6 Conclusion
This paper provided an analysis of convergence rate for PINNs. Our results give a way
about how to set depth and width of networks to achieve the desired convergence rate in
terms of number of training samples. The estimation on the approximation error of deep
ReLU3 network is established in C2 norms. The statistical error can be derived technically
by the Rademacher complexity of the non-Lipschitz composition with ReLU3 network.
It is interesting to extend the current analysis of PINNs to PDEs with other boundary
conditions or the optimal control (inverse) problems.
Acknowledgments
This work is supported by the National Key Research and Development Program
of China (No. 2020YFA0714200), by the National Science Foundation of China (No.
12125103, No. 12071362, No. 11971468, No. 11871474, No.11871385), by the Natural
Science Foundation of Hubei Province (No. 2021AAA010, No.2019CFA007), and by the
Fundamental Research Funds for the Central Universities. The numerical calculations
have been done at the Supercomputing Center of Wuhan University.
References
[1] S. A GMON , A. D OUGLIS , AND L. N IRENBERG, Estimates near the boundary for solutions of
elliptic partial differential equations satisfying general boundary conditions. I, Communications
on Pure and Applied Mathematics, 12 (1959), pp. 623–727.
[2] C. A NITESCU , E. ATROSHCHENKO , N. A LAJLAN , AND T. R ABCZUK, Artificial neural network
methods for the solution of second order boundary value problems, CMC–Computers Materials &
Continua, 59 (2019), pp. 345–359.
[3] M. A NTHONY AND P. L. B ARTLETT, Neural Network Learning: Theoretical Foundations, Cam-
bridge University Press, 2009.
[4] P. L. B ARTLETT, N. H ARVEY, C. L IAW, AND A. M EHRABIAN, Nearly-tight vc-dimension and
pseudodimension bounds for piecewise linear neural networks, J. Mach. Learn. Res., 20 (2019),
pp. 1–17.
[5] J. B ERNER , M. D ABLANDER , AND P. G ROHS, Numerically solving parametric families of high-
dimensional Kolmogorov partial differential equations via deep learning, in Advances in Neural
Information Processing Systems, vol. 33, Curran Associates, Inc., 2020, pp. 16615–16627.
[6] S. B RENNER AND R. S COTT, The Mathematical Theory of Finite Element Methods, vol. 15,
Springer Science & Business Media, 2007.
[7] P. G. C IARLET, The Finite Element Method for Elliptic Problems, SIAM, 2002.
[8] C. D UAN , Y. J IAO , Y. L AI , X. L U , AND Z. YANG, Convergence rate analysis for deep Ritz method,
Communications in Computational Physics, (2022).
[9] I. G OODFELLOW, J. P OUGET-A BADIE , M. M IRZA , B. X U , D. WARDE -FARLEY, S. O ZAIR ,
A. C OURVILLE , AND Y. B ENGIO, Generative adversarial networks, Advances in Neural Infor-
mation Processing Systems, 3 (2014).
[10] J. H AN , A. J ENTZEN , AND E. W EINAN, Solving high-dimensional partial differential equations
using deep learning, Proceedings of the National Academy of Sciences, 115 (2018), pp. 8505–
8510.
[11] K. H E , X. Z HANG , S. R EN , AND J. S UN, Delving deep into rectifiers: Surpassing human-level
performance on imagenet classification, in Proceedings of the IEEE International Conference on
Computer Vision, 2015, pp. 1026–1034.
[12] Q. H ONG , J. W. S IEGEL , AND J. X U, Rademacher complexity and numerical quadrature analysis
of stable neural networks with applications to numerical PDEs, arXiv preprint arXiv:2104.02903,
(2021).
[13] T. J. H UGHES, The Finite Element Method: Linear Static and Dynamic Finite Element Analysis,
Courier Corporation, 2012.
[14] A. D. J AGTAP, E. K HARAZMI , AND G. E. K ARNIADAKIS, Conservative physics-informed neural
networks on discrete domains for conservation laws: Applications to forward and inverse problems,
Computer Methods in Applied Mechanics and Engineering, 365 (2020), p. 113028.
[15] Y. J IAO , L. K ANG , Y. L AI , AND Y. X U, Approximation of deep RelUk neural network in Sobolev
spaces, (2022).
[16] I. E. L AGARIS , A. L IKAS , AND D. I. F OTIADIS, Artificial neural networks for solving ordinary
and partial differential equations, IEEE Transactions on Neural Networks, 9 (1998), pp. 987–
1000.
[17] L. L U , X. M ENG , Z. M AO , AND G. E. K ARNIADAKIS, DeepXDE: A deep learning library for
solving differential equations, SIAM Review, 63 (2021), pp. 208–228.
[18] Y. L U , H. C HEN , J. L U , L. Y ING , AND J. B LANCHET, Machine learning for elliptic PDEs: Fast
rate generalization bound, neural scaling law and minimax optimality, in ICLR, 2021.
[19] Y. L U , J. L U , AND M. WANG, A priori generalization analysis of the deep Ritz method for solving
high dimensional elliptic partial differential equations, in Conference on Learning Theory, PMLR,
2021, pp. 3196–3241.
[20] T. L UO AND H. YANG, Two-layer neural networks for partial differential equations: Optimization
and generalization theory, arXiv preprint arXiv:2006.15733, (2020).
[21] S. M ISHRA AND R. M OLINARO, Estimates on the generalization error of physics-informed neu-
ral networks for approximating a class of inverse problems for PDEs, IMA Journal of Numerical
Analysis, (2021).
[22] G. PANG , M. D’E LIA , M. PARKS , AND G. K ARNIADAKIS, nPINNs: Nonlocal physics-informed
neural networks for a parametrized nonlocal universal Laplacian operator. Algorithms and applica-
tions, Journal of Computational Physics, 422 (2020), p. 109760.
[23] G. PANG , L. L U , AND G. E. K ARNIADAKIS, fPINNs: Fractional physics-informed neural net-
works, SIAM Journal on Scientific Computing, 41 (2019), pp. A2603–A2626.
[24] A. Q UARTERONI AND A. VALLI, Numerical Approximation of Partial Differential Equations,
vol. 23, Springer Science & Business Media, 2008.
[25] M. R AISSI , P. P ERDIKARIS , AND G. E. K ARNIADAKIS, Physics-informed neural networks: A
deep learning framework for solving forward and inverse problems involving nonlinear partial differ-
ential equations, Journal of Computational Physics, 378 (2019), pp. 686–707.
[26] Y. S HIN , J. D ARBON , AND G. E. K ARNIADAKIS, On the convergence of physics in-
formed neural networks for linear second-order elliptic and parabolic type PDEs, arXiv preprint
arXiv:2004.01806, (2020).
[27] Y. S HIN , Z. Z HANG , AND G. E. K ARNIADAKIS, Error estimates of residual minimization using
neural networks for linear PDEs, arXiv preprint arXiv:2010.08019, (2020).
[28] D. S ILVER , A. H UANG , C. J. M ADDISON , A. G UEZ , L. S IFRE , G. VAN D EN D RIESSCHE ,
J. S CHRITTWIESER , I. A NTONOGLOU , V. PANNEERSHELVAM , M. L ANCTOT, ET AL ., Master-
ing the game of Go with deep neural networks and tree search, Nature, 529 (2016), pp. 484–489.
[29] J. A. S IRIGNANO AND K. S PILIOPOULOS, DGM: A deep learning algorithm for solving partial
differential equations, Journal of Computational Physics, 375 (2018), pp. 1339–1364.
[30] J. T HOMAS, Numerical Partial Differential Equations: Finite Difference Methods, vol. 22, Springer
Science & Business Media, 2013.
[31] A. W. VAN D ER VAART AND J. A. W ELLNER, Weak convergence, in Weak Convergence and
Empirical Processes, Springer, 1996.
[32] E. W EINAN AND B. Y U, The deep Ritz method: A deep learning-based numerical algorithm for
solving variational problems, Communications in Mathematics and Statistics, 6 (2017), pp. 1–
12.
[33] Y. Z ANG , G. B AO , X. Y E , AND H. Z HOU, Weak adversarial networks for high-dimensional partial
differential equations, Journal of Computational Physics, 411 (2020), p. 109409.

A Rate of Convergence of Physics Informed Neural Networks For The Linear Second Order Elliptic PDEs

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Rate of Convergence of Physics Informed Neural Networks For The Linear Second Order Elliptic PDEs

Uploaded by

Copyright:

Available Formats

Commun. Comput. Phys. Vol. 31, No. 4, pp.

A Rate of Convergence of Physics Informed Neural

Clear Water Bay, Kowloon, Hong Kong.

Received 16 September 2021; Accepted (in revised version) 16 March 2022

AMS subject classifications: 62G05, 65N12, 65N15, 68T07

∗ Corresponding author. Email addresses: yulingjiaomath@whu.edu.cn (Y. Jiao), laiyanming@whu.edu.cn

Our contributions and main results

kū − uφ kC2 (Ω̄) ≤ ǫ,

where ubφ is the minimizer in (2.4).

2 The PINNs method and error decomposition

N (D , W , {k·kX , B},Φ) := { f : f is implemented by a neural network with depth D

C (Ω) := {all the continuous functions defined on Ω},

+|∂Ω|EY∼U (∂Ω) (e(Y )u(Y )− g(Y ))2 , (2.2)

2.2 Error decomposition

Proof. For any ū ∈ P , we have

Combining (2.5)-(2.7) yields the result.

P = N (D , W , {k·kC2 (Ω̄) , B}, {ReLU3 }),

network where the activation function ρ( x) is given by

It can be seen that ρ( x) is twice differentiable.

L3 (u) = |Ω|E X ∼U (Ω) (c( X )u( X ))2 , L4 ( u ) = | Ω |E X ∼ U ( Ω ) f ( X ) 2 ,

4.1 Rademacher complexity, covering number and pseudo-dimension

By the technique of symmetrization, we can bound the difference between continuous

Definition 4.1. [31] The Rademacher complexity of a set A ⊆ R N is defined by

For Rademacher complexity, we have following structural result.

where w( x)F := {ū : ū( x) = w( x)u( x), u ∈ F }.

This completes the proof.

F1 = {± f : Ω → R | ∃u ∈ P and 1 ≤ i, j,i′ , j′ ≤ d s.t. f ( x) = u xi x j ( x)u xi′ x j′ ( x)},

F5 = {± f : Ω → R | ∃u ∈ P and 1 ≤ i, j,i′ ≤ d s.t. f ( x) = u xi x j ( x)u xi′ ( x)},

− aij ( Xk ) ai′ j′ ( Xk )u xi x j ( Xk )u xi′ x j′ ( Xk )|.

Next we give an upper bound of Rademacher complexity in terms of the covering

F |Zn = {(u( Z1 ),u( Z2 ), ··· ,u( Zn )) : u ∈ N 3 }.

The uniform covering number C∞ (ǫ, F ,n) is defined by

C∞ (ǫ, F ,n) = maxn C(ǫ, F |Zn ,d∞ ).

sign(u( xi )− yi ) = bi , i = 1,2, ··· ,n,

4.2 Bound on statistical error

Proposition 4.2. Let u be a function implemented by a ReLU2 − ReLU3 (ReLU3 ) network

We now present the bound for pseudo-dimension of N (D , W , {ReLU,ReLU2 ,ReLU3 })

Proposition 4.3. For any D , W ∈ N,

Pdim(N (D , W , {ReLU,ReLU2 ,ReLU3 })) = O(D 2 W 2 (D + log W )).

f = {ue( x,y) = sign(u( x)− y) : u ∈ N (D , W , {ReLU,ReLU2 ,ReLU3 })}.

K{ xi },{yi } (m) := |{(sign( f ( x1 ,a)− y1 ), ··· ,sign(u( xm ,a)− ym )) : a ∈ R M }|

|{(sign(u( x1 ,a)− y1 ), ··· ,sign(u( xm ,a)− ym )) : a ∈ Pi }|

We then have  Mi

This completes the proof.

where δ is an arbitrarily small number, then we have

where P = N (D , W , {k·kC2 (Ω̄) , B}, {ReLU3 }).

Proof. We need the following Lemma. Let Φ = {ReLU,ReLU2 ,ReLU3 }.

Lemma 4.6. Let {Fi }13

F1 ⊂ N1 := N (D + 5,2(D + 2)(D + 4)W , {k·kC(Ω̄ ) , B 2 },Φ),

This completes the proof.

5 Convergence rate for the PINNs

and let the number of samples be

Without loss of generality we assume that ǫ is small enough such that

Hence uφ belongs to the function class

with δ being an arbitrarily positive number, we have

Combining Proposition 2.1, (5.1) and (5.2) yields the result.

You might also like

We then have Mi