Professional Documents
Culture Documents
A Rate of Convergence of Physics Informed Neural Networks For The Linear Second Order Elliptic PDEs
A Rate of Convergence of Physics Informed Neural Networks For The Linear Second Order Elliptic PDEs
1272-1295
doi: 10.4208/cicp.OA-2021-0186 April 2022
Abstract. In recent years, physical informed neural networks (PINNs) have been
shown to be a powerful tool for solving PDEs empirically. However, numerical analy-
sis of PINNs is still missing. In this paper, we prove the convergence rate to PINNs for
the second order elliptic equations with Dirichlet boundary condition, by establishing
the upper bounds on the number of training samples, depth and width of the deep
neural networks to achieve desired accuracy. The error of PINNs is decomposed into
approximation error and statistical error, where the approximation error is given in C2
norm with ReLU3 networks (deep network with activation function max{0,x3 }) and
the statistical error is estimated by Rademacher complexity. We derive the bound on
the Rademacher complexity of the non-Lipschitz composition of gradient norm with
ReLU3 network, which is of immense independent interest.
1 Introduction
Classical numerical methods such as the finite element method are successful to solve
the low-dimensional PDEs, see e.g., [6, 7, 13, 24, 30]. However these methods may en-
counter some difficulties in both theoretical analysis and numerical implementation for
http://www.global-sci.com/cicp 1272 c
2022 Global-Science Press
Y. Jiao et al. / Commun. Comput. Phys., 31 (2022), pp. 1272-1295 1273
the high-dimensional PDEs. Motivated by the facts that deep learning method for high-
dimensional data analysis has been achieved great successful applications in discrimi-
native, generative and reinforcement learning [9, 11, 28], solving high dimensional PDEs
with deep neural networks becomes an extremely potential approach and has attracted
a lot of attentions [2, 5, 10, 17, 25, 29, 32, 33]. Due to the excellent approximation ability
of the deep neural networks, several numerical schemes have been proposed to solve
PDEs with deep neural networks including the deep Ritz method (DRM) [32], physics-
informed neural networks (PINNs) [25], and deep Galerkin method (DGM) [33]. Both
DRM and DGM are applied to variational forms of PDEs, and PINNs are based on resid-
ual minimization to the differential equation, see [2, 17, 25, 29], which can be extended to
general PDEs [14, 16, 22, 23].
Despite the above mentioned deep PDEs solvers work well empirically, rigorous nu-
merical analysis for these methods are far from complete. The convergence rate of DRM
with two layer networks and deep networks are studied in [8, 12, 19, 20], the convergence
of PINNs are given in [21, 26, 27]. In this work, we will provide the nonasymptotic con-
vergence rate of the PINNs with ReLU3 networks, i.e., a quantitative error estimation
with respect to the topological structure of the neural networks (the depth and width)
and the number of the samples. Hence it gives a rule to determine the hyper-parameters
to achieve a desired accuracy. Our contributions are summarized as follows.
where d, Ω̄, C (d, kūkC3 (Ω̂) ) stands for the dimension of x, the closure of the domain
Ω and some numerical constant that only depends on (d, kū kC3 (Ω̂) ), respectively.
• We establish an upper bound of the statistical error for PINNs by applying the
tools of Pseudo dimension, especially we give an upper bound of the Rademacher
complexity to the derivative of ReLU3 network which is non-Lipschitz composi-
tion with ReLU3 network, via calculating the Pseudo dimension of networks with
ReLU, ReLU2 and ReLU3 activation functions, see Theorem 4.1. We prove that
∀D , W ∈ N and ǫ > 0, if the number of training samples in PINNs is with the order
2+ δ
O D 6 W 2 (D + log W ) 1ǫ , where δ is an arbitrarily positive number, then the
statistical error
b
E { Xk } N ,{Yk } M sup L(u)− L(u) ≤ ǫ,
k =1 k =1
u ∈P
where L and Lb are loss functions defined in (2.2) and (2.3) respectively.
1274 Y. Jiao et al. / Commun. Comput. Phys., 31 (2022), pp. 1272-1295
• Based on the above bounds on approximation error and statistical error, we estab-
lish a nonasymptotic convergence rate to PINNs for second order elliptic equations
with Dirichlet boundary condition, see Theorem 5.1, i.e., ∀ǫ > 0 if we set the depth
and width by
1
D = O(⌈log2 d⌉+ 2), W = O d
ǫ
in the ReLU3 network and set the number of training samples used in PINNs as
2d+4+δ
O( 1ǫ ), where δ is an arbitrarily positive number, then
E { Xk } N ,{Yk } M
ubφ − u∗
1 ≤ ǫ,
k =1 k =1 H 2 (Ω)
The paper is organized as follows. In Section 2 we describe the problem setting and
introduce the error decomposition for the PINNs. In Section 3 the approximation error of
ReLU3 networks in Sobolev spaces based on approximation results of B-splines is proved.
In Section 4 we provide the statistical error estimation by using the bound of Rademacher
complexity. In Section 5 the convergence rate of PINNs is shown. A conclusion and short
discussion is given in Section 6.
f0 (x) = x,
(ℓ)
fℓ (x) = ̺ ℓ ( Aℓ fℓ−1 + bℓ ) = (̺i (( Aℓ fℓ−1 + bℓ )i )) for ℓ = 1, ··· , D − 1,
f := fD (x) = AD fD−1 + bD ,
(ℓ) (ℓ)
where Aℓ = aij ∈ R nℓ ×nℓ−1 , bℓ = bi ∈ R nℓ and ̺ℓ : R nℓ → R nℓ is the active function. The
hyper-parameters D and W := max{ Nℓ , ℓ= 0, ··· , D} are called the depth and the width of
L
the network, respectively. Also ∑ℓ= 1 n ℓ is called the number of units of f and φ ={ A ℓ ,b ℓ }ℓ
are called the weight parameters. Let Φ be a set of activation functions and X be a Banach
space, we define the neural network function class by
Then we introduce the function spaces and the partial differential equations what we are
interested in. The governing equation is defined in an open bounded domain Ω ⊂ R d with
Y. Jiao et al. / Commun. Comput. Phys., 31 (2022), pp. 1272-1295 1275
smooth boundary ∂Ω, and without loss of generality we may assume that Ω ⊂ [0,1]d . Let
α = (α1 , ··· ,αn ) be an n-dimensional index with |α| := ∑ ni=1 αi and s be a nonnegative inte-
ger. The standard function spaces include continuous function space, L p space, Sobolev
spaces are given below.
If s is a nonnegative real number, the fractional Sobolev space W s,p (Ω) can be defined as
follows: setting θ = s −⌊s⌋ and
Z Z | D α f ( x)− D α f (y)| p
W (Ω) := f : Ω → R
s,p
dxdy < ∞, ∀|α| = ⌊s⌋ ,
Ω Ω | x − y|θ p+ d
Z Z
! 1/p
p | D α f ( x)− D α f (y)| p
k f kW s,p (Ω) := k f kW ⌊s⌋,p (Ω) + ∑ dxdy .
|α|=⌊ s ⌋ Ω Ω
| x − y|θ p+ d
s,p
Let C0∞ (Ω) be the set of smooth functions with compact support in Ω, and W0 (Ω) is the
− s,q
completion space of C0∞ (Ω) in W s,p (Ω). For s < 0, W s,p (Ω) is the dual space of W0 (Ω)
with q satisfying 1p + 1q = 1. When p = 2, W s,p (Ω) is a Hilbert space and it is also denoted
by H s (Ω). The constant C may vary from place to place but it is independent to the
hyper-parameters of neural networks.
We will consider the following linear second order elliptic equation with Dirichlet
∂u 2u
boundary condition, where ∂x i
( x) and ∂x∂i ∂x j
( x) are denoted by u xi and u xi x j , respectively.
d d
−
∑ ij xi xj ∑ bi uxi + cu = f ,
a u + in Ω,
(2.1)
i,j =1 i=1
eu = g, on ∂Ω,
where aij ∈ C (Ω̄), bi ,c ∈ L∞ (Ω), f ∈ L2 (Ω), e ∈ C (∂Ω), g ∈ L2 (∂Ω) with the strictly elliptic
condition, i.e., there exists a constant λ > 0 such that ∑i,j aij ξ i ξ j ≥ λ|ξ |2 , ∀ x ∈ Ω,ξ ∈ R d .
1276 Y. Jiao et al. / Commun. Comput. Phys., 31 (2022), pp. 1272-1295
Define the constants by A = maxi,j {kaij kC(Ω̄) }, B = maxi {kbi k L∞ (Ω) }, C = kck L∞ (Ω) , E =
kekC(∂Ω) , F = k f k L2 (Ω) , G = k gk L2 (∂Ω) , and Z = {A,B,C,E,F,G}.
Instead of solving problem (2.1), we consider a minimization problem with the loss
functional L on C2 (Ω)∩ C (Ω̄):
Z
!2 Z
d d
L(u) := − ∑ aij ux x + ∑ bi ux + cu − f
i j i
dx + (eu − g)2 dx.
Ω i,j=1 i=1 ∂Ω
Assumption 2.1. Assume that (2.1) has a unique strong solution u∗ ∈ C2 (Ω)∩ C (Ω̄).
If Assumption 2.1 holds, then u∗ is also the unique minimizer of loss functional
R L.
Define |Ω| and
R |∂Ω| be the measure of Ω and its boundary respectively, i.e., |Ω| := Ω 1dx
and |∂Ω| := ∂Ω 1ds, then L can be equivalently written by
!2
d d
L(u) =|Ω|E X ∼U (Ω) − ∑ aij (X )ux x (X )+ ∑ bi (X )ux (X )+ c(X )u(X )− f (X )
i j i
i,j=1 i=1
where U (Ω) and U (∂Ω) are uniform distribution on Ω and ∂Ω, respectively.
To solve minimization of L(u) numerically, a standard discrete version of L is given
by:
!2
N d d
b u) = | Ω |
N k∑
L( − ∑ aij ( Xk )u xi x j ( Xk )+ ∑ bi ( Xk )u xi ( Xk )+ c( Xk )u( Xk )− f ( Xk )
=1 i,j=1 i=1
|∂Ω| M
+ ∑ (e(Yk )u(Yk )− g(Yk ))2 , (2.3)
M k=1
where {Xk }kN=1 and {Yk }kM=1 are i.i.d. random samples according to the uniform distribu-
tion U (Ω) on Ω and U (∂Ω) on ∂Ω, respectively.
The PINNs method considers a minimization problem with respect to Lb:
b u φ ),
min L( (2.4)
u φ ∈P
where the admissible set P refers to the deep neural network function class parameter-
ized by φ. Assume there exists at least one minimizer to (2.4), which is denote by ubφ , and
we will give an error estimation to u∗ and ubφ for some carefully chosen admissible set P .
The definition of P will be given at beginning of Section 3.
Remark 2.1. In practical, it is difficult to compute ubφ precisely due to the nonlinear struc-
ture of P . People often call a (random) solver A, say SGD, to solve (2.4) and let the output
uφA be the final solution. In this work we ignore the optimization error and only consider
the error between u∗ and ubφ .
Y. Jiao et al. / Commun. Comput. Phys., 31 (2022), pp. 1272-1295 1277
Remark 2.2. The existence of the minimizer to problem (2.4) depends on the choice of the
admissible set P . In case of the minimizer may not exist for some admissible set P , we
b ubφ ) ≤ infu ∈P L(
may consider a γ-optimal solution ubφ , i.e., L( b uφ )+ γ for a arbitrary small
φ
positive γ. The main results in this work are also true for the γ-optimal solution.
Next we decompose the error into the approximation error and statistical error sepa-
rately.
T
Proposition 2.1. Assume that P ⊂ H 2 (Ω) C (Ω̄), then
kubφ − u∗ k2 1
H 2 (Ω)
h i
≤ C inf 3max{2d2 A2 ,dB2 ,C2 }kū − u∗ k2H2 (Ω) +|∂Ω|E2 kū − u∗ k2C(∂Ω)
ū∈P
| {z }
E ap p
b u) .
+ Csup L(u)− L(
u ∈P
| {z }
E sta
b ū) ≤ 0. Since ū can be any element in
where the last step is due to the fact that Lb ubφ − L(
P , we take the infimum for both side and obtain:
b u) .
L ubφ −L(u∗ ) ≤ inf [L( ū)−L(u∗ )]+ sup L(u)− L( (2.5)
ū∈P u ∈P
1278 Y. Jiao et al. / Commun. Comput. Phys., 31 (2022), pp. 1272-1295
Since u∗ is the strong solution of Eq. (2.1) and L(u∗ ) = 0, then ∀u ∈ P we have
L(u)−L(u∗ )
Z
!2 Z
d d
= − ∑ aij ux x + ∑ bi ux + cu − f
i j i
dx + (eu − g)2 dx
Ω i,j=1 i=1 ∂Ω
Z
!2 Z
d d
∗ ∗ ∗
= − ∑ aij (u − u ) x i x j + ∑ bi ( u − u ) x i + c ( u − u ) dx + e2 (u − u∗ )2 dx
Ω i,j=1 i=1 ∂Ω
!2 !2
Z d d
≤3 ∑ aij (u − u∗ )x x i j
+ ∑ bi ( u − u ∗ ) x i
+(c(u − u∗ ))2 dx
Ω i,j=1 i=1
Z
+ e2 (u − u∗ )2 dx
∂Ω
≤3max{2d2 A2 ,dB2 ,C2 }ku − u∗ k2H2 (Ω) +|∂Ω|E2 ku − u∗ k2C(∂Ω) . (2.6)
On the other hand, by Lemma 2.1 and inequality k·k 3 ≤ C k·k L2 (Ω) , we have ∀u ∈
H − 2 (Ω)
1 T
H 2 (Ω) L2 (∂Ω)
L(u)−L(u∗ )
Z
!2 Z
d d
∗ ∗ ∗
= − ∑ aij (u − u ) xi x j + ∑ bi (u − u ) xi + c(u − u ) dx + e2 (u − u∗ )2 dx
Ω i,j=1 i=1 ∂Ω
≥ C k u − u ∗ k2 1 . (2.7)
H 2 (Ω)
The approximation error E app describes the expressive power of the parameterized
function class P in H 2 (Ω) and C (Ω̄) norm, which corresponds to the approximation
error in FEM known as the Céa’s lemma [7]. The statistical error Esta is caused by the
b in (2.3).
Monte Carlo discretization of L(·) defined in (2.2) with L(·)
3 Approximation error
T
We will choose P as a ReLU3 networks, to ensure P ⊂ H 2 (Ω) C (Ω̄). More precisely,
where the hyper-parameters B= 2ku∗ kC2 (Ω̄) and {D , W } will be given in later discussions
(i.e., Theorem 5.1) to ensure the desired accuracy. The ReLU3 network is refer to a neural
Y. Jiao et al. / Commun. Comput. Phys., 31 (2022), pp. 1272-1295 1279
4 Statistical error
In this section we give the statistical error with parameterized function class P , by estab-
lishing the Rademacher complexity of the non-Lipschitz composition of ReLU3 network
and its partial derivative. The technique used here may be helpful to analysis the statis-
tical errors for other deep PDEs solvers, where the main difficulties are also to estimated
the Rademacher complexity of non-Lipschitz composition induced by the derivative op-
erator.
Firstly by carefully computation and triangle inequality, we have the following lemma.
Lemma 4.1.
13
b u) ≤ ∑ E cj (u) ,
E { Xk } N M sup L(u)− L( { Xk } N M sup L j (u)− L
k =1,{Yk } k =1 k =1,{Yk } k =1
u ∈P j=1 u ∈P
where
!2
d
L1 ( u ) = | Ω |E X ∼ U ( Ω ) ∑ aij (X )ux x (X ) i j
,
i,j=1
!2
d
L2 ( u ) = | Ω |E X ∼ U ( Ω ) ∑ bi ( X ) u x ( X )
i
,
i=1
!
d
L 8 ( u ) = 2| Ω | E X ∼ U ( Ω ) ∑ bi ( X ) u x ( X )
i
· c ( X ) u ( X ),
i=1
!
d
L 9 ( u ) = − 2| Ω | E X ∼ U ( Ω ) ∑ bi ( X ) u x ( X )
i
· f ( X ),
i=1
L10 (u) = −2|Ω|E X ∼U (Ω) c( X )u( X ) f ( X ), L11 (u) = |∂Ω|EY∼U (∂Ω) (e(Y )u(Y ))2 ,
L12 (u) = |∂Ω|EY∼U (∂Ω) ( g(Y ))2 , L13 (u) = −2|∂Ω|EY∼U (∂Ω) e(Y )u(Y ) g(Y ),
and Lbj (u) is the discrete sample version of L j (u) by replacing expectation with sample average
j = 1, ··· ,13.
where {σk }kN=1 are N i.i.d Rademacher variables with P (σk = 1) = P (σk = −1) = 21 .
Let Ω be a set and F be a function class which maps Ω to R. Let P be a probability
distribution over Ω and {Xk }kN=1 be i.i.d. samples from P. The Rademacher complexity
of F associated with distribution P and sample size N is defined by
" #
1 N
RP,N (F ) = E { Xk ,σk } N sup ∑ σk u( Xk ) .
u ∈F N k=1
k =1
Lemma 4.2. Let Ω be a set and P be a probability distribution over Ω. Let N ∈ N. Assume that
w : Ω → R and |w( x)| ≤ B for all x ∈ Ω, then for any function class F mapping Ω to R, there
holds
RP,N (w( x)F ) ≤ B RP,N (F ),
Proof.
N
1
RP,N (w( x)·F ) = E { Xk ,σk } N sup ∑ σk w( Xk )u( Xk )
N k =1
u ∈F k=1
" #
N
1
= E N E N sup w ( X1 ) u ( X1 )+ ∑ σk w ( Xk ) u ( Xk )
2N { Xk }k =1 {σk }k =2 u∈F k=2
" #
N
1
+ E N E N sup − w ( X1 ) u ( X1 )+ ∑ σk w ( Xk ) u ( Xk )
2N { Xk }k =1 {σk }k =2 u∈F k=2
1
= E N E N
2N {"Xk }k =1 {σk }k =2 #
N N
′ ′
sup w( X1 )[u( X1 )− u ( X1 )]+ ∑ σk w( Xk )u( Xk )+ ∑ σk w( Xk )u ( Xk )
u,u ′ ∈F k=2 k=2
1
≤ E N E N
2N {"Xk }k =1 {σk }k =2 #
N N
sup B|u( X1 )− u′ ( X1 )|+ ∑ σk w( Xk )u( Xk )+ ∑ σk w( Xk )u′ ( Xk )
u,u ′ ∈F k=2 k=2
1
= E N E N
2N {"Xk }k =1 {σk }k =2 #
N N
sup B[u( X1 )− u′ ( X1 )]+ ∑ σk w( Xk )u( Xk )+ ∑ σk w( Xk )u′ ( Xk )
u,u ′ ∈F k=2 k=2
" #
N
1
= E{ Xk ,σk } N sup σ1 B u( X1 )+ ∑ σk w( Xk )u( Xk )
N k =1
u ∈F k=2
N
B
≤ ··· ≤ E { Xk ,σk } N sup ∑ σk u( Xk ) = B RP,N (F ).
N k =1
u ∈F k=1
Lemma 4.3. Let {Xk }kN=1 be i.i.d. samples from U (Ω), then we have
cj (u) ≤ Cd ZRU (Ω),N (F j ),
E { Xk } N sup L j (u)− L j = 1, ··· ,13,
k =1
u ∈P
where
d
|Ω |
= E { Xk } N E { Xf } N sup ∑ E {σk } N
N k =1 k k =1
u ∈P i,i′ ,j,j′ =1
k =1
N h i
fk )u x x ( X
fk ) ai′ j′ ( X fk )− aij ( Xk ) ai′ j′ ( Xk )u x x ( Xk )u x ′ x ′ ( Xk )
fk )u x ′ x ′ ( X
∑ σk aij ( X
k=1 i j i j i j i j
|Ω | d N h i
fk )u x x ( X
fk ) ai′ j′ ( X fk )
fk )u x ′ x ′ ( X
≤ E { X ,Xf ,σ } N sup ∑ ∑ σk aij ( X
u ∈P i,i′ ,j,j′ =1 k=1
i j
N k k k k =1 i j
|Ω | d N h i
+ E N sup
N { Xk ,Xfk ,σk }k =1 u∈P i,i′ ,j,j ∑′ ∑ σk aij (Xk )ai′ j′ (Xk )uxi xj (Xk )uxi′ xj′ (Xk )
=1 k=1
2| Ω | d N h i
= E N sup
N { Xk ,σk }k =1 u∈P i,i′ ,j,j ∑′ ∑ σk aij (Xk )ai′ j′ (Xk )uxi xj (Xk )uxi′ xj′ (Xk )
=1 k=1
2| Ω | d N h i
≤ ∑
N i,i′ ,j,j′ =1
E { Xk ,σk } N sup ∑ σk aij ( Xk ) ai′ j′ ( Xk )u xi x j ( Xk )u xi′ x j′ ( Xk )
k = 1
u ∈P k=1
2| Ω | d N
= ∑
N i,i′ ,j,j′ =1
E N
{ Xk ,σk } k =1 sup ∑ σk aij ( Xk ) ai′ j′ ( Xk ) f i,i′ ,j,j′ ( Xk )
f ′ ′ ∈F1 k=1
i,i ,j,j
2|Ω|A2 d N
≤ ∑
N i,i′ ,j,j′ =1
E { Xk ,σk } N
k =1
sup ∑ σk f i,i′ ,j,j′ ( Xk )
f ′ ′ ∈F1 k=1
i,i ,j,j
2 4
≤ 2|Ω|A d RU (Ω),N (F1 ),
where the second and eighth steps are from Jensen’s inequality and Lemma 4.2, the third
and seventh steps are due to the facts that the insertion of Rademacher variables doesn’t
change the distribution and that F1 is symmetric(i.e., if f ∈F1 , then − f ∈F1 ), respectively.
Definition 4.2. [3] Suppose that W ⊂ R n . For any ǫ > 0, let V ⊂ R n be an ǫ-cover of
W with respect to the distance d∞ , that is, for any w ∈ W, there exists a v ∈ V such that
d∞ (u,v) < ǫ, where d∞ is defined by d∞ (u,v) := max1≤i≤n |ui − vi |. The covering number
C(ǫ,W,d∞ ) is defined to be the minimum cardinality among all ǫ-cover of W with respect
to the distance d∞ .
Definition 4.3. [3] Suppose that F is a class of functions from Ω to R. Given n sample
Zn = ( Z1 ,Z2 , ··· ,Zn ) ∈ Ωn , F |Zn ⊂ R n is defined by
Lemma 4.4. Let Ω be a set and P be a probability distribution over Ω. Let N ∈ N ≥1 . Let F
be a class of functions from Ω to R such that 0 ∈ F and the diameter of F is less than B , i.e.,
kuk L∞ (Ω) ≤ B , ∀u ∈ F . Then
Z q
12 B
RP,N (F ) ≤ inf 4δ + √ log(2C∞ (ǫ, F , N ))dǫ .
0< δ<B N δ
Proof. The proof is based on the chaining method, see [31].
By Lemma 4.4, we have to bound the covering number, which can be upper bounded
via Pseudo-dimension [3].
Definition 4.4. Let F be a class of functions from X to R. Suppose that S={ x1 ,x2 , ··· ,xn }⊂
X. We say that S is pseudo-shattered by F if there exists y1 ,y2 , ··· ,yn such that for any
b ∈ {0,1}n , there exists a u ∈ F satisfying
and we say that {yi }ni=1 witnesses the shattering. The pseudo-dimension of F , denoted
as Pdim(F ), is defined to be the maximum cardinality among all sets pseudo-shattered
by F .
The following proposition showing a relation between uniform covering number and
pseudo-dimension.
Proposition 4.1 (Theorem 12.2 [3]). Let F be a class of real functions from a domain X to
the bounded interval [0, B]. Let ǫ > 0. Then
Pdim (F ) i
n B
C∞ (ǫ, F ,n) ≤ ∑ i ǫ
,
i=1
en B
Pdim(F )
which is less than ǫ ·Pdim(F )
for n ≥ Pdim(F ).
Proof. For activation function ρ in each unit, we denote ρe as its derivative, i.e., ρe( x)=ρ′ ( x).
We then have (
2ReLU, ρ = ReLU2 ,
ρe( x) =
3ReLU2 , ρ = ReLU3 .
Let 1 ≤ i ≤ d. We deal with the first two layers in details and apply induction for layers
k ≥ 3 since there are a little bit difference for the first two layer. For the first layer, we have
for any q = 1,2 ··· ,n1
! !
d d
( 1) ( 1) ( 1) ( 1) g( 1) ( 1) ( 1) ( 1)
Di u q = Di ρq ∑ aqj x j + bq = ρq ∑ aqj x j + bq · aqi .
j=1 j=1
( 1)
Hence Di uq can be implemented by a ReLU− ReLU2 − ReLU3 network with depth 2 and
width 1. For the second layer,
! !
n1 n1 n1
( 2) ( 2) ( 2) ( 1) ( 2) g( 2) ( 2) ( 1) ( 2) ( 2) ( 1)
Di u q = Di ρq ∑ aqj u j + bq = ρq ∑ aqj u j + bq · ∑ aqj Di u j .
j=1 j=1 j=1
g( 2) ( 2) ( 1) ( 2) ( 2) ( 1)
Since ρq ∑nj=1 1 aqj u j + bq and ∑nj=1 1 aqj Di u j can be implemented by two ReLU −
ReLU2 −ReLU3 subnetworks, respectively, and the multiplication can also be implemented
by
1
x·y = ( x + y)2 −( x − y)2
4
1
= ReLU2 ( x + y)+ ReLU2 (− x − y)− ReLU2 ( x − y)− ReLU2 (− x + y) , (4.1)
4
( 2)
we conclude that Di uq can be implemented by a ReLU − ReLU2 − ReLU3 network. We
have !! !!
n1 n1
g( 2) ( 2) ( 1) ( 2) g( 2) ( 2) ( 1) ( 2)
D ρq ∑ aqj u j + bq = 3, W ρq ∑ aqj u j + bq ≤W
j=1 j=1
and ! !
n1 n1
( 2) ( 1) ( 2) ( 1)
D ∑ aqj Di u j = 2, W ∑ aqj Di u j ≤ W.
j=1 j=1
1286 Y. Jiao et al. / Commun. Comput. Phys., 31 (2022), pp. 1272-1295
( 2) ( 2)
Thus D Di uq = 4, W Di uq ≤ max{2W ,4}. Now we apply induction for layers k ≥ 3.
For the third layer,
! !
n2 n2 n2
( 3) ( 3) ( 3) ( 2) ( 3) g
( 3) ( 3) ( 2) ( 3) ( 3) ( 2)
Di u q = Di ρq ∑ aqj u j + bq = ρq ∑ aqj u j + bq · ∑ aqj Di u j .
j=1 j=1 j=1
Since
!! !!
n2 n2
( 3) ( 3) ( 2) ( 3) ( 3) ( 3) ( 2) ( 3)
D ρq ∑ aqj u j + bq = 4, W ρq ∑ aqj u j + bq ≤W
j=1 j=1
and ! !
n2 n1
( 3) ( 2) ( 3) ( 2)
D ∑ aqj Di u j = 4, W ∑ aqj Di u j ≤ max{2W ,4W } = 4W ,
j=1 j=1
( 3)
we conclude that Di uq can be implemented by a ReLU − ReLU2 − ReLU3 network and
( 3) ( 3)
D Di uq = 5, W Di uq ≤ max{5W ,4} = 5W .
( k)
We assume that Di uq (q = 1,2, ··· ,nk ) can be implemented by a ReLU-ReLU2 network
( k) ( 3)
and D Di uq = k + 2, W Di uq ≤ (k + 2)W . For the (k + 1)−th layer,
!
nk
( k + 1) ( k + 1) ( k + 1) ( k ) ( k + 1)
Di u q = Di ρq ∑ aqj u j + bq
j=1
!
nk nk
^( k + 1) ( k + 1) ( k ) ( k + 1) ( k + 1) ( k)
= ρq ∑ aqj u j + bq · ∑ aqj Di u j .
j=1 j=1
Since
!!
nk
^( k + 1) ( k + 1) ( k ) ( k + 1)
D ρq ∑ aqj u j + bq = k + 2,
j=1
!!
nk
^( k + 1) ( k + 1) ( k ) ( k + 1)
W ρq ∑ aqj u j + bq ≤W
j=1
and
!
nk
( k + 1) ( k)
D ∑ aqj Di u j = k + 2,
j=1
!
nk
( k + 1) ( k)
W ∑ aqj Di u j ≤ max{(k + 2)W ,4W } = (k + 2)W ,
j=1
Y. Jiao et al. / Commun. Comput. Phys., 31 (2022), pp. 1272-1295 1287
( k + 1)
we conclude that Di uq can be implemented by a ReLU − ReLU2 − ReLU3 network and
( k + 1) ( k + 1)
D Di u q = k + 3, W Di u q ≤ max{(k + 3)W ,4} = (k + 3)W .
Hence we derive that Di u = Di u1D can be implemented by a ReLU − ReLU2 − ReLU3 net-
work and D ( Di u) = D + 2, W ( Di u) ≤ (D + 2)W . And through our argument, we know
that the neural networks implementing { Di u}di=1 have the same architecture.
Lemma 4.5 (Theorem 8.3 in [3]). Let p1 , ··· , pm be polynomials with n variables of degree at
most d. If n ≤ m, then
n
n 2emd
|{(sign( p1 ( x)), ··· ,sign( pm ( x))) : x ∈ R }| ≤ 2 .
n
Proof. The argument is follows from the proof of Theorem 6 in [4]. The result stated here
is somewhat stronger then Theorem 6 in [4] since VCdim(sign(F )) ≤ Pdim (F ) for any
function class F . We consider a new set of functions
It is clear that Pdim(N (D , W , {ReLU,ReLU2 ,ReLU3 }))≤ VCdim(Nf). We now bound the
VC-dimension of N f. Denoting M as the total number of parameters (weights and biases)
in the neural network implementing functions in N , in our case we want to derive the
uniform bound for
over all { xi }m m
i=1 ⊂ X and {yi }i=1 ⊂ R. Actually the maximum of K { xi },{ yi } ( m ) over all
m m
{ xi }i=1 ⊂ X and {yi }i=1 ⊂ R is the growth function GN f( m ). In order to apply Lemma
M
4.5, we partition the parameter space R into several subsets to ensure that in each
subset u( xi ,a)− yi is a polynomial with respect to a without any breakpoints. In fact, our
partition is exactly the same as the partition in [4]. Denote the partition as { P1 ,P2 , ··· ,PN }
with some integer N satisfying
D−1 Mi
2emki (1 +(i − 1)3i−1 )
N≤ ∏ 2 , (4.2)
i=1
Mi
1288 Y. Jiao et al. / Commun. Comput. Phys., 31 (2022), pp. 1272-1295
where ki and Mi denotes the number of units at the ith layer and the total number of
parameters at the inputs to units in all the layers up to layer i of the neural network
implementing functions in N , respectively. See [4] for the construction of the partition.
Obviously we have
N
K{ xi },{yi } (m) ≤ ∑ |{(sign(u( x1 ,a)− y1 ), ··· ,sign(u( xm ,a)− ym )) : a ∈ Pi }|. (4.3)
i=1
Note that u( xi ,a)− yi is a polynomial with respect to a with degree the same as the degree
of u( xi ,a), which is equal to 1 +(D − 1)3D−1 as shown in [4]. Hence by Lemma 4.5, we
have
D Mi
2emki (1 +(i − 1)3i−1 )
K { xi },{ yi } ( m ) ≤ ∏ 2 .
i=1
Mi
With the above preparations, we are able to derive our result on the statistical error.
Theorem 4.1. Let D , W ∈ N, B ∈ R + . For any ǫ > 0, if the number of sample
2+ δ
6 1 2
N, M = C (d,Ω,Z, B)D W (D + log W ) ,
ǫ
Proof. By Proposition 4.2, we know that for u is a ReLU3 network with depth D and
width W , u xi can be implemented by a ReLU2 − ReLU3 network with depth D + 2 and
width (D + 2)W . Then by Proposition 4.2 again we have that u xi x j can be implemented
by a ReLU−ReLU2 −ReLU3 network with depth D+4 and width (D+2)(D+4)W . These
facts combining with (4.1) yields the results.
By Lemma 4.4 and Proposition 4.1, we have for i = 1,2, ··· ,10,
Z q
12 Bi
RU (Ω),N (Fi ) ≤ inf 4δ + √ log(2C (ǫ, Fi , N ))dǫ
0< δ<Bi N δ
v
Z Bi u
u Pdim(Fi ) !
12 tlog 2 eN Bi
≤ inf 4δ + √ dǫ
0< δ<Bi N δ ǫ · Pdim(Fi )
s !
Z
12Bi 12 Bi eN Bi
≤ inf 4δ + √ + √ Pdim(F1 )· log dǫ . (4.5)
0< δ<Bi N N δ ǫ · Pdim (Fi )
q
eN Bi eN Bi 2
Now we calculate the integral. Let t = log( ǫ·Pdim (F )
), then ǫ = Pdim (F )
· e−t . Denoting
i i
s s
eN Bi eN Bi
t1 = log , t2 = log ,
Bi · Pdim(Fi ) δ · Pdim(Fi )
1290 Y. Jiao et al. / Commun. Comput. Phys., 31 (2022), pp. 1272-1295
we have
s
Z Bi Z t2
eN Bi 2eN Bi 2
log dǫ = t2 e−t dt
δ ǫ · Pdim(Fi ) Pdim(Fi ) t1
Z t2 2
!′
2eN Bi − e− t
= t dt
Pdim(Fi ) t1 2
Z t2
eN Bi − t21 − t22 − t2
= t1 e − t2 e + e dt
Pdim(Fi ) t1
eN Bi h −t21 2 2
i
≤ t1 e − t2 e−t2 +(t2 − t1 )e−t1
Pdim(Fi )
s
eN Bi 2
− t1 eN Bi
≤ · t2 e = kuk log . (4.6)
Pdim(Fi ) δ · Pdim(Fi )
Pdim(F i ) 1/2
Combining (4.5) and (4.6) and choosing δ = Bi N ≤ Bi , we get for i =
1,2,3,5,6,7,8,9,10,
s !
Z
12Bi 12 Bi eN Bi
RU (Ω),N (Fi ) ≤ inf 4δ + √ + √ Pdim(Fi )· log dǫ
0< δ<Bi N N δ ǫPdim(Fi )
p s !
12Bi 12Bi Pdim(Fi ) eN Bi
≤ inf 4δ + √ + √ log
0< δ<Bi N N δ · Pdim(Fi )
r 1/2 s
3 Pdim(Fi ) eN
≤ 28 Bi log
2 N Pdim(Fi )
r s
3 Pdim(Ni ) 1/2 eN
≤ 28 Bi log
2 N Pdim(Ni )
r 1/2 s
3 2 H eN
≤ 28 max{B , B } log , (4.7)
2 N H
with
H = 4C (D + 2)2 (D + 4)2 (D + 5)2 W 2 (D + 5 + log((D + 2)(D + 4)W )),
where in the forth step we apply Lemma 4.6 and we use Proposition 4.3 in the last step.
Similarly for i = 11,13,
r 1/2 s
3 2 H eM
RU (Ω),N (Fi ) ≤ 28 max{B , B } log . (4.8)
2 M H
Obviously, RU (Ω),N (F4 ) and RU (Ω),N (F12 ) can be bounded by the right hand side of (4.7)
Y. Jiao et al. / Commun. Comput. Phys., 31 (2022), pp. 1272-1295 1291
and (4.8), respectively. Combining Lemma 4.1 and 4.3 and (4.7) and (4.8), we have
b u)
E { Xk } N ,{Yk } M sup L(u)− L(
k =1 k =1
u ∈P
13
c
≤ ∑ E { Xk } N M sup L j (u)− L j (u)
k =1,{Yk } k =1
j=1 u ∈P
r 1/2 s 1/2 s !
3 H eN H eM
≤ 28 max{B , B 2 } 40|Ω|C1 log + 12|∂Ω|C2 log ,
2 N H M H
where
C1 = max{A2 d4 ,B2 d2 ,C2 d2 ,F2 ,ABd3 ,ACd2 ,AFd2 ,BCd,BFd,CF},
C2 = max{E2 ,G2 ,EG}.
Hence for any ǫ > 0, if the number of sample
2+ δ
1
N, M = C (d,Ω,Z, B)D 6 W 2 (D + log W )
ǫ
with δ being an arbitrarily small number, then we have
b u) ≤ ǫ.
E { Xk } N ,{Yk } M sup L(u)− L(
k =1 k =1
u ∈P
Proof. For any ǫ > 0, by Theorem 3.1, there exists an neural network function uφ with
d
depth ⌈log2 d⌉+ 2 and width C (d,Ω,Z, ku∗ kC3 (Ω̄) ) 1ǫ such that
1/2
∗ ǫ2
ku − uφ kC2 (Ω̄) ≤ .
3C (d2 + 3d + 4)|Ω| max {2d2 A2 ,dB2 ,C2 }+ 2C |∂Ω|E2
kuφ kC2 (Ω̄) ≤ ku∗ − uφ kC2 (Ω̄) +ku∗ kC2 (Ω̄) ≤ 2ku∗ kC2 (Ω̄) .
And
3 ǫ2
E app ≤ (d2 + 3d + 4)|Ω| max {2d2 A2 ,dB2 ,C2 }kuφ − u∗ k2C2 (Ω̄) +|∂Ω|E2 kuφ − u∗ k2C2 (Ω̄) ≤ .
2 2C
(5.1)
By Theorem 4.1, when the number of sample
4+ δ 2d+4+δ
6 2 1 ∗ 1
N, M = C (d,Ω,Z, B)D W (D + log W ) = C (d,Ω,Z, ku kC3 (Ω̄) )
ǫ ǫ
Remark 5.1. The asymptotic convergence results for PINNs have been studied in [21, 26,
27], i.e., when the number of parameters in the neural networks and number of training
samples go to infinity, the solution of PINNs with converges to the solution of the PDEs’.
Our work establishes a nonasymptotic convergence rate of PINNs. According to our
results in Theorem 5.1, the influence of the depth and width in the neural networks and
number of training samples are characterized quantitatively. It gives an answer on how
to choose the hyperparameters to archive the desired accuracy, which is missing in [21,
26, 27].
Remark 5.2. As shown in Theorem 5.1, the PINNs suffers from the curse of dimension-
ality. However, if no additional smoothness or low-dimensional compositional structure
on the underlying solutions of PDEs are imposed, the curse is unavoidable. Recently, the
minimax lower bound for PINNs are proved in [18], where they showed that to achieve
Y. Jiao et al. / Commun. Comput. Phys., 31 (2022), pp. 1272-1295 1293
error of ǫ for PINNs to solve elliptical equation defined on [0,1]d whose solution lives
in H 1 ([0,1]d ), the number of samples n are lower bounded by O((1/ǫ)d ). In [19], the
authors reduce the curses by assuming the underlying solutions living in Barron spaces
which is much smaller than H 1 . One can also utilize the structures of the solution to
reduce the curse, for example by considering PDEs whose solution are composition of
functions with number of variables much smaller than d.
6 Conclusion
This paper provided an analysis of convergence rate for PINNs. Our results give a way
about how to set depth and width of networks to achieve the desired convergence rate in
terms of number of training samples. The estimation on the approximation error of deep
ReLU3 network is established in C2 norms. The statistical error can be derived technically
by the Rademacher complexity of the non-Lipschitz composition with ReLU3 network.
It is interesting to extend the current analysis of PINNs to PDEs with other boundary
conditions or the optimal control (inverse) problems.
Acknowledgments
This work is supported by the National Key Research and Development Program
of China (No. 2020YFA0714200), by the National Science Foundation of China (No.
12125103, No. 12071362, No. 11971468, No. 11871474, No.11871385), by the Natural
Science Foundation of Hubei Province (No. 2021AAA010, No.2019CFA007), and by the
Fundamental Research Funds for the Central Universities. The numerical calculations
have been done at the Supercomputing Center of Wuhan University.
References
[1] S. A GMON , A. D OUGLIS , AND L. N IRENBERG, Estimates near the boundary for solutions of
elliptic partial differential equations satisfying general boundary conditions. I, Communications
on Pure and Applied Mathematics, 12 (1959), pp. 623–727.
[2] C. A NITESCU , E. ATROSHCHENKO , N. A LAJLAN , AND T. R ABCZUK, Artificial neural network
methods for the solution of second order boundary value problems, CMC–Computers Materials &
Continua, 59 (2019), pp. 345–359.
[3] M. A NTHONY AND P. L. B ARTLETT, Neural Network Learning: Theoretical Foundations, Cam-
bridge University Press, 2009.
[4] P. L. B ARTLETT, N. H ARVEY, C. L IAW, AND A. M EHRABIAN, Nearly-tight vc-dimension and
pseudodimension bounds for piecewise linear neural networks, J. Mach. Learn. Res., 20 (2019),
pp. 1–17.
[5] J. B ERNER , M. D ABLANDER , AND P. G ROHS, Numerically solving parametric families of high-
dimensional Kolmogorov partial differential equations via deep learning, in Advances in Neural
Information Processing Systems, vol. 33, Curran Associates, Inc., 2020, pp. 16615–16627.
1294 Y. Jiao et al. / Commun. Comput. Phys., 31 (2022), pp. 1272-1295
[6] S. B RENNER AND R. S COTT, The Mathematical Theory of Finite Element Methods, vol. 15,
Springer Science & Business Media, 2007.
[7] P. G. C IARLET, The Finite Element Method for Elliptic Problems, SIAM, 2002.
[8] C. D UAN , Y. J IAO , Y. L AI , X. L U , AND Z. YANG, Convergence rate analysis for deep Ritz method,
Communications in Computational Physics, (2022).
[9] I. G OODFELLOW, J. P OUGET-A BADIE , M. M IRZA , B. X U , D. WARDE -FARLEY, S. O ZAIR ,
A. C OURVILLE , AND Y. B ENGIO, Generative adversarial networks, Advances in Neural Infor-
mation Processing Systems, 3 (2014).
[10] J. H AN , A. J ENTZEN , AND E. W EINAN, Solving high-dimensional partial differential equations
using deep learning, Proceedings of the National Academy of Sciences, 115 (2018), pp. 8505–
8510.
[11] K. H E , X. Z HANG , S. R EN , AND J. S UN, Delving deep into rectifiers: Surpassing human-level
performance on imagenet classification, in Proceedings of the IEEE International Conference on
Computer Vision, 2015, pp. 1026–1034.
[12] Q. H ONG , J. W. S IEGEL , AND J. X U, Rademacher complexity and numerical quadrature analysis
of stable neural networks with applications to numerical PDEs, arXiv preprint arXiv:2104.02903,
(2021).
[13] T. J. H UGHES, The Finite Element Method: Linear Static and Dynamic Finite Element Analysis,
Courier Corporation, 2012.
[14] A. D. J AGTAP, E. K HARAZMI , AND G. E. K ARNIADAKIS, Conservative physics-informed neural
networks on discrete domains for conservation laws: Applications to forward and inverse problems,
Computer Methods in Applied Mechanics and Engineering, 365 (2020), p. 113028.
[15] Y. J IAO , L. K ANG , Y. L AI , AND Y. X U, Approximation of deep RelUk neural network in Sobolev
spaces, (2022).
[16] I. E. L AGARIS , A. L IKAS , AND D. I. F OTIADIS, Artificial neural networks for solving ordinary
and partial differential equations, IEEE Transactions on Neural Networks, 9 (1998), pp. 987–
1000.
[17] L. L U , X. M ENG , Z. M AO , AND G. E. K ARNIADAKIS, DeepXDE: A deep learning library for
solving differential equations, SIAM Review, 63 (2021), pp. 208–228.
[18] Y. L U , H. C HEN , J. L U , L. Y ING , AND J. B LANCHET, Machine learning for elliptic PDEs: Fast
rate generalization bound, neural scaling law and minimax optimality, in ICLR, 2021.
[19] Y. L U , J. L U , AND M. WANG, A priori generalization analysis of the deep Ritz method for solving
high dimensional elliptic partial differential equations, in Conference on Learning Theory, PMLR,
2021, pp. 3196–3241.
[20] T. L UO AND H. YANG, Two-layer neural networks for partial differential equations: Optimization
and generalization theory, arXiv preprint arXiv:2006.15733, (2020).
[21] S. M ISHRA AND R. M OLINARO, Estimates on the generalization error of physics-informed neu-
ral networks for approximating a class of inverse problems for PDEs, IMA Journal of Numerical
Analysis, (2021).
[22] G. PANG , M. D’E LIA , M. PARKS , AND G. K ARNIADAKIS, nPINNs: Nonlocal physics-informed
neural networks for a parametrized nonlocal universal Laplacian operator. Algorithms and applica-
tions, Journal of Computational Physics, 422 (2020), p. 109760.
[23] G. PANG , L. L U , AND G. E. K ARNIADAKIS, fPINNs: Fractional physics-informed neural net-
works, SIAM Journal on Scientific Computing, 41 (2019), pp. A2603–A2626.
[24] A. Q UARTERONI AND A. VALLI, Numerical Approximation of Partial Differential Equations,
vol. 23, Springer Science & Business Media, 2008.
[25] M. R AISSI , P. P ERDIKARIS , AND G. E. K ARNIADAKIS, Physics-informed neural networks: A
Y. Jiao et al. / Commun. Comput. Phys., 31 (2022), pp. 1272-1295 1295
deep learning framework for solving forward and inverse problems involving nonlinear partial differ-
ential equations, Journal of Computational Physics, 378 (2019), pp. 686–707.
[26] Y. S HIN , J. D ARBON , AND G. E. K ARNIADAKIS, On the convergence of physics in-
formed neural networks for linear second-order elliptic and parabolic type PDEs, arXiv preprint
arXiv:2004.01806, (2020).
[27] Y. S HIN , Z. Z HANG , AND G. E. K ARNIADAKIS, Error estimates of residual minimization using
neural networks for linear PDEs, arXiv preprint arXiv:2010.08019, (2020).
[28] D. S ILVER , A. H UANG , C. J. M ADDISON , A. G UEZ , L. S IFRE , G. VAN D EN D RIESSCHE ,
J. S CHRITTWIESER , I. A NTONOGLOU , V. PANNEERSHELVAM , M. L ANCTOT, ET AL ., Master-
ing the game of Go with deep neural networks and tree search, Nature, 529 (2016), pp. 484–489.
[29] J. A. S IRIGNANO AND K. S PILIOPOULOS, DGM: A deep learning algorithm for solving partial
differential equations, Journal of Computational Physics, 375 (2018), pp. 1339–1364.
[30] J. T HOMAS, Numerical Partial Differential Equations: Finite Difference Methods, vol. 22, Springer
Science & Business Media, 2013.
[31] A. W. VAN D ER VAART AND J. A. W ELLNER, Weak convergence, in Weak Convergence and
Empirical Processes, Springer, 1996.
[32] E. W EINAN AND B. Y U, The deep Ritz method: A deep learning-based numerical algorithm for
solving variational problems, Communications in Mathematics and Statistics, 6 (2017), pp. 1–
12.
[33] Y. Z ANG , G. B AO , X. Y E , AND H. Z HOU, Weak adversarial networks for high-dimensional partial
differential equations, Journal of Computational Physics, 411 (2020), p. 109409.