Neural Networks Discover PDEs with Sparse Regression

Sparsely constrained neural networks for model
discovery of PDEs
Gert-Jan Both Remy Kusters

Centre de Recherche Interdisciplinaire (CRI) Centre de Recherche Interdisciplinaire (CRI)
Inserm Inserm
arXiv:2011.04336v1 [cs.LG] 9 Nov 2020
Paris, France Paris, France

gert-jan.both@cri-paris.org remy.kusters@cri-paris.org
Abstract
Sparse regression on a library of candidate features has developed as the prime

method to discover the PDE underlying a spatio-temporal dataset. As these features
consist of higher order derivatives, model discovery is typically limited to low-noise
and dense datasets due to the erros inherent to numerical differentiation. Neural
network-based approaches circumvent this limit, but to date have ignored advances
in sparse regression algorithms. In this paper we present a modular framework
that combines deep-learning based approaches with an arbitrary sparse regression
technique. We demonstrate with several examples that this combination facilitates
and enhances model discovery tasks. We release our framework as a package at
https://github.com/PhIMaL/DeePyMoD
1 Introduction
Model discovery aims at finding interpretive models in the form of PDEs from large spatio-temporal
data sets. Most algorithms apply some form of sparse regression on a predefined set of candidate
terms, as initially proposed by Brunton et al. in SINDY [3] and by Rudy et al with PDE-find [17]. By
writing the unknown differential equation as ∂t u = f (u, ux , ...) and assuming the right-hand side is
a linear combination of predefined terms, i.e. f (u, ux , ...) = au + bux + ... = Θξ, model discovery
reduces to finding a sparse coefficient vector ξ. Calculating the time derivative ut and the library
matrix Θ is notoriously hard for noisy and sparse data since it involves calculating higher order
derivatives. The error in these terms is typically fairly large due to the use of numerical differentiation,
limiting classical model discovery to low-noise and densely sampled datasets. Recent works [9, 21, 6,
15] employ deep learning-based methods to circumvent this issue. For example, Both et al. [2] use a
neural network to construct a ’digital twin’ [16] of the data and calculate ut and Θ from this digital
twin using automatic differentiation. This approach significantly improves the accuracy of the time
derivative and the library in noisy and sparse data sets, but suffers from convergence issues and, to
date, does not include advanced sparse regression techniques.
In this paper we present a modular framework for model discovery that combines a neural network-
based approximation of the presented data with state of the art sparse regression techniques. Our
framework consists of a function approximator to construct a surrogate of the data, a function to
construct the library of features, a sparse regression algorithm to select the active components from
the feature library and a constraint on the function approximator, based on the active components.
We show how varying these components improves the performance of the model discovery with
three experiments: (i) Replacing the gradient descent-based constraint with a least squares based
method. (ii) Comparing a threshold-based Lasso sparsity estimator with more advanced schemes
such as PDE-find [17]. (iii) Using alternative neural network architectures such as Siren [20].
Preprint. Under review.

Data Funct. approx. Library Sparsity
Determine mask
Underlying PDE
Constraint
Sparsity mask
mse reg
Figure 1: Schematic overview of our framework. (I) A function approximator constructs a surrogate
of the data, (II) from which a Library of possible terms and the time derivative is constructed using
automatic differentiation. (III) A sparsity estimator selects the active terms in the library using
sparse regression and (IV) the function approximator is constrained to solutions allowed by the active
terms by the constraint.
2 Deep-learning based model discovery with sparse regression

Framework Deep learning-based model discovery typically uses a neural network to construct a
noiseless surrogate û of the data u. A library of potential terms Θ is constructed using automatic
differentiation from û and the neural network is constrained to solutions allowed by this library [2].
The loss function of the network thus consists of two contributions, (i) a mean square error to learn
the mapping (~x, t) → û and (ii) a term to constrain the network,
N N
1 X 2 1 X 2
L= (ui − ûi ) + (∂t ûi − Θi ξ) . (1)
N i=1 N i=1
The sparse coefficient vector ξ is learned concurrently with the network parameters and plays two
roles: 1) determining the active (i.e. non-zero) components of the underlying PDE and 2) constraining
the network according to these active terms. We propose to separate these two tasks by decoupling
the constraint from the sparsity selection process itself. We first calculate a sparsity mask g and then
constrain the network only by the active terms in the mask. Mathematically, we replace ξ by ξ ◦ g.
The sparsity mask g need not be calculated differentiably, so that any classical, non-differentiable
sparse estimator can be used. Our approach has several additional advantages: i) It provides an
unbiased estimate of the coefficient vector since we do not apply l1 or l2 regularisation on ξ, ii) the
sparsity pattern is determined from the full library Θ, rather than only from the remaining active
terms, allowing dynamic addition and removal of active terms throughout training, and iii) we can use
cross validation or similar methods in the sparse estimator to find the optimal hyperparameters for
model selection. Finally, we note that the sparsity mask g mirrors the role of attention in transformers
[1].
Using this change, we construct a general framework for deep learning based model discovery
with any classical sparsity promoting algorithm in figure 1. A function approximator constructs
a surrogate of the data, (II) from which a Library of possible terms and the time derivative is
constructed using automatic differentiation. (III) A sparsity estimator selects the active terms in the
library using sparse regression and (IV) the function approximator is constrained to solutions allowed
by the active terms by the constraint.
Training As the sparsity estimator is non-differentiable, determining the sparsity mask before the
function approximator has reasonably approximated the data can adversely affect training if the
wrong terms are selected. We thus split the dataset into a train- and test-set and update the sparsity
mask only when the MSE on the test-set starts to increase. After updating the mask, the model needs
to adjust to the tighter constraint and we hence update the sparsity pattern every 25 epochs after the
first update. Final convergence is reached when the l1 norm of the coefficient vector remains constant.
2
In practice we observe that large datasets with little noise might discover the correct equation after a
single sparsity update, but that highly noisy datasets typically require several updates, removing only
a few terms at a time.
Package We provide our framework in a python based package at

https://github.com/PhIMaL/DeePyMoD, with the documentation available at
https://phimal.github.io/DeePyMoD/. Mirroring our approach, each model is comprised of
four ’plug-in’ modules: a function approximator, library, constraint and sparsity estimator module.
Each module can be replaced without affecting the other modules, allowing for quick experimentation.
Our framework is built on Pytorch [11] and any Pytorch model (i.e. RNN, GNN) can be used as
function approximator. The sparse estimator module follows the Scikit-learn API [12, 4], i.e., all the
build-in Scikit-learn estimators, such as those in PySindy[19] or SK-time [10], can be used. The
modularity of our framework allows for easy extension and for the user to benefit from deep learning
techniques as well as more classical approaches.
3 Experiments
Constraint The sparse coefficient vector ξ in
eq. 1 is typically found by optimising it concur-
rently with the neural network parameters θ. Con- A Train loss
sidering a network with parameter configuration
Grad. desc.
θ∗ , the problem of finding ξ can be rewritten as 4 Lst. sq.
2
minξ |ut (θ∗ ) − Θ(θ∗ )ξ| . This can be analytically
solved by least squares under mild assumptions; we Log cost
calculate ξ by solving this problem every iteration, 6
rather than optimizing it using gradient descent. In
figure 2 we compare the two constraining strategies
on a Burgers dataset1 , by training for 5000 epochs 8
without updating the sparsity mask2 . Panel A) shows
Epoch
that the least-squares approach reaches a consistently B 0
Coefficient error
lower loss. More strikingly, we show in panel B) that
the mean absolute error in the coefficients is three
1
Log MAE
orders of magnitude lower. We explain the differ-

ence as a consequence of the random initialisation
of ξ: the network is initially constrained by incorrect 2
coefficients, prolonging convergence. The random Grad. desc.
initialisation also causes the larger spread in results
3 Lst. sq.
compared to the least squares method. The least
squares method does not suffer from sensitivity to the 0 1 2 3 4 5
initialisation and consistently converges. Epoch 1e3
Sparsity estimator The sparsity estimator is not Figure 2: A) Loss and B) mean absolute error
differentiated through and we can thus use any spar- of the coefficients obtained with the gradi-
sity promoting algorithm. Here we show that a classi- ent descent and the least squares constraint
cal method for PDE model discovery, PDE-find [17], as a function of the number of epochs. Re-
can be used together with neural networks to perform sults have been averaged over twenty runs and
model discovery in highly sparse and noisy datasets. shaded area denotes the standard deviation.
We compare it with the thresholded Lasso3 in figure
3 approach [2] on a Burgers dataset 4 with varying
amounts of noise. The PDE-find estimator discovers
the correct equation in the majority of cases, even with up to 60% − 80% noise, whereas the thresh-
olded lasso mostly fails at 40%. We emphasise that the modular approach we propose here allows to
1
We solve ut = ux x + νuux with a delta-peak initial condition for ν = 0.1 for x = [−3, 4], t = [0.5, 5],
randomly sample 2000 points and add 10% white noise.
2
All experiments use a network with a tanh activation function of 5 layers with 30 neurons per layer. The
network is optimized using the ADAM optimiser with a learning rate of 2e−3 and β = (0.99, 0.999).
3
We use a pre-set threshold of 0.1.
4
See footnote 2, only with 1000 points randomly sampled.
3
combine classical and deep learning-based techniques. More advanced sparsity estimators such as
SR3 [5] can easily be included in this framework.
Function approximator We show in figure

4 that a tanh-based NN fails to converge on
a dataset of the Kuramoto-Shivashinksy (KS) 1.0 Threshold
equation5 (panel A and B). Consequently, the co-
Fraction correct
efficient vectors are incorrect (Panel D). As our
0.8 PDE-find
framework is agnostic to the underlying func- 0.6
tion approximator, we instead use a SIREN 6 , 0.4
which is able to learn very sharp features in the
underlying dynamics. 0.2
In panel B we show that a SIREN is able to learn 0.0
the complex dynamics of the KS equation and 0 20 40 60 80 100
in panel C that it discovers the correct equation7 . Noise level (%)
This example shows that the choice of function
approximator can be a decisive factor in the suc-
cess of neural network based model discovery. Figure 3: Fraction of correct discovered Burgers
Using our framework we can also explore us- equations (averaged over 10 runs) as function of
ing RNNs, Neural ODEs [14] or Graph Neural the noise level for the thresholded lasso and PDE-
Networks [18]. find sparsity estimator.
A Kuramoto-Shivashinksy B MSE C SIREN

10 1
0
5
10 2
t
D 10 Tanh
10 3
0
SIREN
Tanh
u
10 4 10
x 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0
Epoch 1e4 Epoch 1e4
Figure 4: A) Solution of the KS equation. Lower panel shows the cross section at the last time
point: t = 44. B) MSE as function of the number of epochs for both the tanh-based and SIREN NN.
Coefficients as function of number of epochs for C) the SIREN. and D) the tanh-based NN. The bold
curves in panel C and D are the terms in the KS equation components; green: uux :, blue: uxx and
orange: uxxxx . Only SIREN is able to discover the correct equation.
5
We solve ∂t u + uux + uxx + uxxxx = 0 between x = [0, 100], t = [0, 44], randomly sample 25000 points
and add 5% white noise.
6
Both networks use 8 layers with 50 neurons. We train the SIREN using ADAM with a learning rate of
2.5e−4 and β = (0.999, 0.999)
7
In bold; uux : green, uxx : blue and uxxxx : orange
4
Acknowledgments
and Disclosure of Funding
Thanks to the Bettencourt Schueller Foundation long term partnership, this work was partly supported
by the CRI Research Fellowship to Remy Kusters. We thank NVidia for supplying the GPU under
the Academic Grant program. We thank the authors and contributors of Numpy [7], Scipy[22],
Scikit-learn [12], Matplotlib[8], Ipython [13], and Pytorch [11] for making our work possible through
their open-source software. The authors declare no competing interest.
References
[1] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. “Neural Machine Translation by
Jointly Learning to Align and Translate”. en. In: arXiv:1409.0473 [cs, stat] (May 2016). arXiv:
1409.0473. URL: http://arxiv.org/abs/1409.0473 (visited on 09/30/2020).
[2] Gert-Jan Both et al. “DeepMoD: Deep learning for Model Discovery in noisy data”. en.
In: arXiv:1904.09406 [physics, q-bio, stat] (Apr. 2019). arXiv: 1904.09406. URL: http :
//arxiv.org/abs/1904.09406 (visited on 11/12/2019).
[3] Steven L. Brunton, Joshua L. Proctor, and J. Nathan Kutz. “Discovering governing equations
from data by sparse identification of nonlinear dynamical systems”. en. In: Proceedings of
the National Academy of Sciences 113.15 (Apr. 2016), pp. 3932–3937. ISSN: 0027-8424,
1091-6490. DOI: 10.1073/pnas.1517384113. URL: http://www.pnas.org/lookup/
doi/10.1073/pnas.1517384113 (visited on 04/18/2019).
[4] Lars Buitinck et al. “API design for machine learning software: experiences from the scikit-
learn project”. en. In: arXiv:1309.0238 [cs] (Sept. 2013). arXiv: 1309.0238. URL: http :
//arxiv.org/abs/1309.0238 (visited on 09/30/2020).
[5] Kathleen Champion et al. “A unified sparse optimization framework to learn parsimonious
physics-informed models from data”. en. In: arXiv:1906.10612 [physics] (June 2019). arXiv:
1906.10612. URL: http://arxiv.org/abs/1906.10612 (visited on 10/28/2019).
[6] Zhao Chen, Yang Liu, and Hao Sun. “Deep learning of physical laws from scarce data”. en.
In: arXiv:2005.03448 [physics, stat] (May 2020). arXiv: 2005.03448. URL: http://arxiv.
org/abs/2005.03448 (visited on 05/31/2020).
[7] Charles R. Harris et al. “Array programming with NumPy”. en. In: Nature 585.7825 (Sept.
2020), pp. 357–362. ISSN: 0028-0836, 1476-4687. DOI: 10.1038/s41586-020-2649-2.
URL: http://www.nature.com/articles/s41586-020-2649-2 (visited on 09/30/2020).
[8] John D. Hunter. “Matplotlib: A 2D Graphics Environment”. In: Computing in Science Engi-
neering 9.3 (May 2007). Conference Name: Computing in Science Engineering, pp. 90–95.
ISSN : 1558-366X. DOI : 10.1109/MCSE.2007.55.
[9] Zichao Long et al. “PDE-Net: Learning PDEs from Data”. en. In: arXiv:1710.09668 [cs, math,
stat] (Oct. 2017). arXiv: 1710.09668. URL: http://arxiv.org/abs/1710.09668 (visited
on 04/18/2019).
[10] Markus Löning et al. “sktime: A Unified Interface for Machine Learning with Time Series”.
en. In: (), p. 10.
[11] Adam Paszke et al. “PyTorch: An Imperative Style, High-Performance Deep Learning Library”.
en. In: arXiv:1912.01703 [cs, stat] (Dec. 2019). arXiv: 1912.01703. URL: http://arxiv.
[12] Fabian Pedregosa et al. “Scikit-learn: Machine Learning in Python”. en. In: MACHINE LEARN-
ING IN PYTHON (), p. 6.
[13] F. Perez and B. E. Granger. “IPython: A System for Interactive Scientific Computing”. In:
Computing in Science Engineering 9.3 (May 2007), pp. 21–29. ISSN: 1558-366X. DOI: 10.
1109/MCSE.2007.53.
[14] Christopher Rackauckas et al. “Universal Differential Equations for Scientific Machine Learn-
ing”. en. In: arXiv:2001.04385 [cs, math, q-bio, stat] (Jan. 2020). arXiv: 2001.04385. URL:
http://arxiv.org/abs/2001.04385 (visited on 05/28/2020).
[15] Maziar Raissi, Paris Perdikaris, and George Em Karniadakis. “Physics Informed Deep
Learning (Part I): Data-driven Solutions of Nonlinear Partial Differential Equations”. en.
In: arXiv:1711.10561 [cs, math, stat] (Nov. 2017). arXiv: 1711.10561. URL: http://arxiv.
5
[16] Adil Rasheed, Omer San, and Trond Kvamsdal. “Digital Twin: Values, Challenges and Enablers
From a Modeling Perspective”. en. In: IEEE Access 8 (2020), pp. 21980–22012. ISSN: 2169-
3536. DOI: 10.1109/ACCESS.2020.2970143. URL: https://ieeexplore.ieee.org/
document/8972429/ (visited on 05/28/2020).
[17] Samuel H. Rudy et al. “Data-driven discovery of partial differential equations”. en. In: Science
Advances 3.4 (Apr. 2017), e1602614. ISSN: 2375-2548. DOI: 10.1126/sciadv.1602614.
URL : http://advances.sciencemag.org/lookup/doi/10.1126/sciadv.1602614
(visited on 04/18/2019).
[18] Sungyong Seo and Yan Liu. “Differentiable Physics-informed Graph Networks”. en. In:
arXiv:1902.02950 [cs, stat] (Feb. 2019). arXiv: 1902.02950. URL: http://arxiv.org/abs/
1902.02950 (visited on 05/28/2020).
[19] Brian M. de Silva et al. “PySINDy: A Python package for the Sparse Identification of Nonlinear
Dynamics from Data”. en. In: arXiv:2004.08424 [physics] (Apr. 2020). arXiv: 2004.08424.
URL: http://arxiv.org/abs/2004.08424 (visited on 09/30/2020).
[20] Vincent Sitzmann et al. “Implicit Neural Representations with Periodic Activation Functions”.
In: arXiv:2006.09661 [cs, eess] (June 2020). arXiv: 2006.09661. URL: http://arxiv.org/
abs/2006.09661 (visited on 07/20/2020).
[21] Yifan Sun, Linan Zhang, and Hayden Schaeffer. “NeuPDE: Neural Network Based Or-
dinary and Partial Differential Equations for Modeling Time-Dependent Data”. en. In:
arXiv:1908.03190 [cs, stat] (Aug. 2019). arXiv: 1908.03190. URL: http://arxiv.org/
abs/1908.03190 (visited on 05/28/2020).
[22] Pauli Virtanen et al. “SciPy 1.0–Fundamental Algorithms for Scientific Computing in Python”.
en. In: Nature Methods 17.3 (Mar. 2020). arXiv: 1907.10121, pp. 261–272. ISSN: 1548-7091,
1548-7105. DOI: 10.1038/s41592-019-0686-2. URL: http://arxiv.org/abs/1907.
10121 (visited on 09/30/2020).

Neural Networks Discover PDEs with Sparse Regression

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Neural Networks Discover PDEs with Sparse Regression

Uploaded by

Copyright:

Available Formats

Sparsely constrained neural networks for model

Gert-Jan Both Remy Kusters

Paris, France Paris, France

Sparse regression on a library of candidate features has developed as the prime

Preprint. Under review.

2 Deep-learning based model discovery with sparse regression

Package We provide our framework in a python based package at

orders of magnitude lower. We explain the differ-

Function approximator We show in figure

A Kuramoto-Shivashinksy B MSE C SIREN

You might also like