Closure Discovery For Coarse-Grained Partial Differential Equations Using Multi-Agent Reinforcement Learning

Closure Discovery for Coarse-Grained Partial Differential Equations
using Multi-Agent Reinforcement Learning
Jan-Philipp von Bassewitz 1 2 Sebastian Kaltenbach 1 2 Petros Koumoutsakos 2
Abstract predict the system’s dynamics may use trillions of com-

Reliable predictions of critical phenomena, such putational elements(Rossinelli et al., 2013) to resolve all
arXiv:2402.00972v1 [cs.LG] 1 Feb 2024
as weather, wildfires and epidemics are often spatio-temporal scales, but these often address only ideal-
founded on models described by Partial Differen- ized systems and their computational cost prevents experi-
tial Equations (PDEs). However, simulations that mentation and uncertainty quantification. On the other hand,
capture the full range of spatio-temporal scales reduced order and coarse-grained models are fast, but lim-
in such PDEs are often prohibitively expensive. ited by the linearization of complex system dynamics while
Consequently, coarse-grained simulations that em- their associated closures are often based on heuristics (Peng
ploy heuristics and empirical closure terms are fre- et al., 2021).
quently utilized as an alternative. We propose a We propose to complement coarse-grained simulations with
novel and systematic approach for identifying clo- closures in the form of a policies that are discovered by a
sures in under-resolved PDEs using Multi-Agent multi-agent reinforcement learning framework. A central
Reinforcement Learning (MARL). The MARL policy that is based on CNNs and uses locality as inductive
formulation incorporates inductive bias and ex- bias, corrects the error of the numerical discretization and
ploits locality by deploying a central policy rep- improves the overall accuracy. This approach is inspired
resented efficiently by Convolutional Neural Net- by recent work in Reinforcement Learning (RL) for image
works (CNN). We demonstrate the capabilities reconstruction (Furuta et al., 2019). The agents learn and
and limitations of MARL through numerical solu- act locally on the grid, in a manner that is reminiscent of the
tions of the advection equation and the Burgers’ numerical discretizations of PDEs based on Taylor series
equation. Our results show accurate predictions approximations. Hence, each agent only gets information
for in- and out-of-distribution test cases as well from its spatial neighborhood. Together with a central
as a significant speedup compared to resolving all policy, this results in efficient training, as each agent needs
scales. to process only a limited amount of observations. After
training, the framework is capable of accurate predictions
in both in- and out-of-distribution test cases. We remark
1. Introduction that the actions taken by the agents are highly correlated
Simulations of critical phenomena such as climate, ocean dy- with the numerical errors and can be viewed as an implicit
namics and epidemics, have become essential for decision- correction to the effective convolution performed via the
making, and their veracity, reliability, and energy demands coarse-grained discretization of the PDEs (Bergdorf et al.,
have great impact on our society. Many of these simu- 2005).
lations are based on models described by PDEs express-
ing system dynamics that span multiple spatio-temporal
scales. Examples include turbulence (Wilcox, 1988), neuro- 2. Related Work
science (Dura-Bernal et al., 2019), climate (Council, 2012)
and ocean dynamics (Mahadevan, 2016). Today, we benefit The main contribution of this work is a MARL algorithm
from decades of remarkable efforts in the development of for the discovery of closures for coarse-grained PDEs. The
numerical methods, algorithms, software, and hardware and algorithm provides an automated process for the correction
witness simulation frontiers that were unimaginable even of the errors of the associated numerical discretization. We
a few years ago (Hey, 2009). Large-scale simulations that show that the proposed method is able to capture scales
beyond the training regime and provides a potent method
1
ETH Zurich, Switzerland 2 Harvard SEAS, United for solving PDEs with high accuracy and limited resolution.
States. Correspondence to: Petros Koumoutsakos <pet-
ros@seas.harvard.edu>.
1
Closure Discovery for Coarse-Grained PDEs using Multi-Agent Reinforcement Learning
Machine Learning and Partial Differential Equations: during extrapolative predictions (Goyal & Bengio, 2022).
In recent years, there has been significant interest in learning One way to achieve this is by shaping appropriately the
the solution of PDEs using Neural Networks. Techniques loss function (Karniadakis et al., 2021; Kaltenbach & Kout-
such as PINNs (Raissi et al., 2019; Karniadakis et al., 2021), sourelakis, 2020; Yin et al., 2021; Wang et al., 2021; 2022),
DeepONet (Lu et al., 2021), the Fourier Neural Operator or by incorporating parameterized mappings that are based
(Li et al., 2020), NOMAD (Seidman et al., 2022), Clifford on known constraints (Greydanus et al., 2019; Kaltenbach
Neural Layers (Brandstetter et al., 2023) and an invertible & Koutsourelakis, 2021; Cranmer et al., 2020). Our RL
formulation (Kaltenbach et al., 2023) have shown promising framework incorporates locality and is thus consistent with
results for both forward and inverse problems. However, numerical discretizations that rely on local Taylor series
there are concerns about their accuracy and related com- based approximations. The incorporation of inductive bias
putational cost, especially for low-dimensional problems in RL has also been attempted by focusing on the beginning
(Karnakov et al., 2022). These methods aim to substitute of the training phase (Uchendu et al., 2023; Walke et al.,
numerical discretizations with neural nets, in contrast to our 2023) in order to shorten the exploration phase.
RL framework, which aims to complement them. Moreover,
their loss function is required to be differentiable, which 3. Methodology
is not necessary for the stochastic formulation of the RL
reward. We consider a time-dependent, parametric, non-linear PDE
defined on a regular domain Ω. The solution of the PDE
Reinforcement Learning: Our work is based upon recent
depends on its initial conditions (ICs) as well as its input
advances in RL (Sutton & Barto, 2018) including PPO
parameters C ∈ C. The PDE is discretized on a spatiotempo-
(Schulman et al., 2017) and the Remember and Forget Ex-
ral grid and the resulting discrete set of equations is solved
perience Replay (ReFER) algorithm (Novati et al., 2019). It
using numerical methods.
is related to other methods that employ multi-agent learn-
ing (Novati et al., 2021; Albrecht et al., 2023; Wen et al., In turn, the number of the deployed computational elements
2022; Bae & Koumoutsakos, 2022; Yang & Wang, 2020) and the structure of the PDE determine whether all of its
for closures except that here we use as reward not global scales have been resolved or whether the discretization
data (such as energy spectrum) but the local dicsretization amounts to a coarse-grained representation of the PDE. In
error as quantified by fine-grained simulations. The present the first case, the (fine grid simulation (FGS)) provides the
approach is designed to solve various PDEs using a central discretized solution ψ, whereas in (coarse grid simulation
policy and it is related to similar work for image reconstruc- (CGS)) the resulting approximation is denoted by ψ̃.1 The
tion (Furuta et al., 2019). We train a central policy for all RL policy can improve the accuracy of the solution ψ̃ by
agents in contrast to other methods where agents are trained introducing an appropriate forcing term in the right-hand
on decoupled subproblems (Freed et al., 2021; Albrecht side of the CGS. For this purpose, FGS of the PDE are used
et al., 2023; Sutton et al., 2023). as training episodes and serve as the ground truth to facil-
itate a reward signal. The CGS and FGS employed herein
Closure Modeling: The development of machine learning
are introduced in the next section. The proposed MARL
methods for discovering closures for under-resolved PDEs
framework is introduced in Section 3.2.
has gained attention in recent years. Current approaches are
mostly tailored to specific applications such as turbulence
modeling (Ling et al., 2016; Durbin, 2018; Novati et al., 3.1. Coarse and Fine Grid Simulation
2021; Bae & Koumoutsakos, 2022) and use data such as We consider a FGS of a spatiotemporal PDE on a Cartesian
energy spectra and drag coefficients of the flow in order to 3D grid with temporal resolution ∆t for NT temporal steps
train the RL policies. In (Lippe et al., 2023), a more general with spatial resolution ∆x, ∆y, ∆z resulting in Nx , Ny , Nz
framework based on diffusion models is used to improve the discretization points. The CGS entails a coarser spatial dis-
solutions of Neural Operators for temporal problems using cretization ∆x
f = d∆x, ∆y f = d∆y, ∆z f = d∆z as well
a multi-step refinement process. Their training is based on as a coarser temporal discretization ∆t = dt ∆t. Here, d
f
supervised learning in contrast to the present RL approach is the spatial and dt the temporal scaling factor. Conse-
which additionally complements existing numerical meth- quently, at a time-step ñ, we define the discretized solu-
ods instead of neural network based surrogates (Li et al., ñ
tion function of the CGS as ψ̃ ∈ Ψ̃ := Rk×Ñx ×Ñy ×Ñz
2020; Gupta & Brandstetter, 2022).
with k being the number of solution variables. The cor-
Inductive Bias: The incorporation of prior knowledge re- responding solution function of the FGS at time-step nf
garding the physical processes described by the PDEs, into is ψ nf ∈ Ψ := Rk×Nx ×Ny ×Nz . The discretized solution
machine learning algorithms is critical for their training 1
In the following, all variables with the ˜ are referring to the
in the Small Data regime and for increasing the accuracy coarse-grid description.
2
function of the CGS can thus be described as a subsampled To encourage learning a policy that represents the non-
version of the FGS solution function and the subsampling resolved spatio-temporal scales, the reward is based on the
operator S : Ψ → Ψ̃ connects the two. difference between the CGS and FGS at time step n. In
The time stepping operator of the CGS G : Ψ̃ × C˜ → Ψ̃ more detail, we define a local reward Rn ∈ RÑx ×Ñy in-
leads to the update rule spired by the reward proposed for image reconstruction in
(Furuta et al., 2019):
ñ+1 ñ
ψ̃ = G(ψ̃ , C̃ ñ ). (1) k
1 X ñ
Similarly, we define the time stepping operator of the FGS
n
Rij = [ψ − S(ψ n )]2 − [ψñ − An − S(ψ n )]2
k w=1 wij
as F : Ψ × C → Ψ. In the case of the CGS, C̃ ∈ C, ˜ is an
(4)
adapted version of C, which for instance involves evaluating Here, the square [·]2 is computed per matrix entry. We note
a parameter function using the coarse grid. that the reward function therefore returns a matrix that gives
To simplify the notation, we set n := ñ = dt nf in the a scalar reward for each discretization point of the CGS. If
following. Hence, we apply F dT -times in order to describe the action An is bringing the discretized solution function
the same time instant with the same index n in both CGS of the CGS ψñ closer to the subsampled discretized solution
and FGS. The update rule of the FGS are referred to as function of the FGS S(ψ n ), the reward is positive, and vice
ψ n+1 = F dt (ψ n , C n ). (2) versa. In case An corresponds to the zero matrix, the reward
is the zero matrix as well.
In line with the experiments performed in Section 4, and During the training process, the objective is to find the
to further simplify the notation, we are dropping the third optimal policy π ∗ that maximizes the mean expected reward
spatial dimension in the following presentation. per discretization point given the discount rate γ and the
observations:
3.2. RL Environment ∞
!
X
∗ n n
π = argmax Eπ γ r̄ (5)
π
n=0
with the mean expected reward

Ñx ,Ñy
n 1 X
n
r̄ = Rij . (6)
Ñx · Ñy i,j=1
3.3. Multi-Agent Formulation

The policy π predicts a local action Ani,j ∈ Rk at each
discretization point which implies a very high dimensional
Figure 1. Illustration of the training environment with the agents continuous action space. Hence, formulating the closures
embedded in the CGS. The reward measures how much the action with a single agentis very challenging. However, since the
taken improves the CGS. rewards are designed to be inherently local, locality can be
used as inductive bias and the RL learning framework can
The environment of our RL framework is summarized in be interpreted as a multi-agent problem (Furuta et al., 2019).
Figure 1. We define the state at step n of the RL environment One agent is placed at each discretization point of the coarse
as the tuple S n := (ψ n , ψñ ). This state is only partially
n
grid with a corresponding local reward Rij . We remark that
observable as the policy is acting only in the CGS. The this approach augments adaptivity as one can place extra
observation is defined as the coarse representation of the agents at additional, suitably selected, discretization points.
discretized solution function and the parameters O n := Each agent develops its own optimal policy and Equation (5)
(ψñ , C̃ n ) ∈ O. Our goal is to train a policy π that makes the is replaced by
dynamics of the CGS to be close to the dynamics of the FGS. ∞
!
To achieve this goal, the action An ∈ A := Rk×Nx ×Ny at
X
∗ t n
πij = argmax Eπij γ Rij , (7)
step n of the environment is a collection of forcing terms πij
t=0
for each discretization point of the CGS. In case the policy
is later used to complement the CGS simulation the update Here, we used O(Ñx Ñy ) agents, which for typically used
function in Equation (1) changes to grid sizes used in numerical simulations, it becomes a large
number compared to typical MARL problem settings (Al-
n+1 n
ψ̃ = G(ψ̃ − An , C̃ n ). (3) brecht et al., 2023).
3
We parametrize the local policies using neural networks. The global value function is trained on the MSE loss
However, since training this many individual neural nets
can become computationally expensive, we parametrize all LV (O n , Gn ) = ||V F CN (O n ) − Gn ||22
the agents together using one fully convolutional network
(FCN) (Long et al., 2015). where Gn ∈ RÑx ×Ñy represents the actual global return ob-
served by Pinteraction with the environment and is computed
N
3.4. Parallelizing Local Policies with a FCN as Gn = i=n γ i−n Ri . Here, N represents the length of
the respective trajectory.
All local policies are parametrized using one FCN such that We have provided an overview of the modified PPO algo-
one forward pass through the FCN computes the forward rithm in Appendix A together with further details regarding
pass for all the agents at once. This approach enforces the the local version of the PPO objective Lπij (Oij , Aij ). Our
aforementioned locality and the receptive field of the FCN implementation of the adapted PPO algorithm is based on
corresponds to the spatial neighborhood that an agent at a the single agent PPO algorithm of the Tianshou framework
given discretization point can observe. (Weng et al., 2022).
We define the collection of policies in the FCN as
3.6. Neural Network Architecture
ΠF CN : O → A (8)
In further discussions, we will refer to ΠF CN as the global

policy. The policy πij of the agent at point (i, j) is subse-
quently implicitly defined through the global policy as
πij (Oij ) := ΠF CN (O) :ij .

(9)
Here, Oij contains only a part of the input contained in O 2 .

For consistency, we will refer to O ij as a local observation
and O as the global observation. We note that the policies
πij (Oij ) map the observation to a probability distribution
at each discretization point (see Section 3.6 for details).
Similar to the global policy, the global value function is
parametrized using a FCN as well. It maps the global ob- Figure 2. The IRCNN backbone takes the current global obser-
servation to an expected return H ∈ H := RÑx ×Ñy at each vation and maps it to a feature tensor. This feature tensor is
passed into two different convolutional layers that predict the
discretization point
per-discretization point Gaussian parameters and state-values. In
V F CN : O → H. (10) the case of a PDE with three spatial dimensions, the architecture
would need to be based on three-dimensional convolutional layers
Similarly to the local policies, we define the local value instead.
function related to the agent at point (i, j) as vij (Oij ) :=
[V F CN (O)]ij . The optimal policy is expected to compensate for errors
introduced by the numerical method and its implementation
3.5. Algorithmic Details on a coarse grid. We use the Image Restoration CNN (IR-
CNN) architecture proposed in (Zhang et al., 2017) as our
In order to solve the optimization problem in Equation (7) backbone for the policy- and value-network. In Figure 2
with the PPO algorithm (Schulman et al., 2017), we modify we present an illustration of the architecture and show that
the single agent approach. the policy- and value network share the same backbone.
Policy updates are performed by taking gradient steps on The policy- and value network only differ by their last con-

Ñx ,Ñy
 volutional layer, which takes the features extracted by the
1 X backbone and maps them either to the parameters µA ∈ A
ES,A∼ΠF CN  Lπij (Oij , Aij ) (11)
Ñx · Ñy i,j=1 and σ A ∈ A or the predicted return value for each agent.
The local policies are assumed to be independent, such that
with the local version of the PPO objective Lπij (Oij , Aij ). we can write the distribution at a specific discretization point
This corresponds to the local objective of the policy of the as
agent at point (i, j) (Schulman et al., 2017). πij (Aij |Oij ) = N (µA,ij , σ A,ij ). (12)
2
The exact content of Oij is depending on the receptive field During training, the actions are sampled to allow for explo-
of the FCN ration, and during inference only the mean is taken as the
4
action of the agent. 4.1.1. I NITIAL C ONDITIONS

We note that the padding method used for the FCN can in-
In order to prevent overfitting and promote generalization,
corporate boundary conditions into the architecture. For in-
we design the initializations of ψ, u and v to be different
stance, in the case of periodic boundary conditions, we pro-
for each episode while still fulfilling the PBCs and guar-
pose to use circular padding that involves wrapping around
anteeing the incompressibility of the velocity field. The
values from one end of the input tensor to the other.
velocity fields are sampled from a distribution DTV rain
ortex
by
In the following, we are referring to the proposed MARL
taking a linear combination of Taylor-Greene vortices and
framework using a FCN for both the policy and the value
an additional random translational field. Further details are
function as CNN-based multi-agent reinforcement learning
provided in Appendix C.2.1. For visualization purposes,
(CNN-MARL).
the concentration of a new episode is set to a random sam-
ple from the MNIST dataset (Deng, 2012) that is scaled to
3.7. Computational Complexity
values in the range [0, 1]. In order to increase the diversity
We note that the computational complexity of the CGS w.r.t. of the initializations, we augment the data by performing
the number of discretization points scales with O(Ñx Ñy ). random rotations of ±90◦ in the image loading pipeline.
As one forward pass through the FCN also scales with
O(Ñx Ñy ) the same is true for CNN-MARL. The FGS em- 4.1.2. T RAINING
ploys a finer grid, which leads to a computational cost that
We train the framework for a total of 2000 epochs and col-
scales with O(dt d2 Ñx Ñy ). This indicates a scaling advan-
lect 1000 transitions in each epoch. More details regarding
tage for CNN-MARL compared to a FGS. Based on these
the training process as well as the training time are provided
considerations and as shown in the next section in the ex-
in Appendix C.
periments, CNN-MARL is able to compress some of the
We note that the amount and length of episodes varies dur-
computations that are performed on the fine grid as it is
ing the training process: The episodes are truncated based
able to significantly improve the CGS while keeping the
on the relative Mean Absolute Error (MAE) defined as
execution time below that of the FGS.
MAE(ψ n , ψñ ) = Ñ 1·Ñ n n
P
i,j |ψij − ψ̃ij |/ψmax between
x y
the CGS and FGS concentrations. Here, ψmax is the maxi-
4. Experiments mum observable value of the concentration ψ. If this error
4.1. Advection Equation exceeds the threshold of 1.5%, the episode is truncated. This
ensures that during training, the CGS and FGS stay close to
First we apply CNN-MARL to the 2D advection equation3 : each other so that the reward signal is meaningful. As the
∂ψ ∂ψ ∂ψ agents become better during the training process, the mean
+ u(x, y) + v(x, y) = 0, (13) episode length increases as the two simulations stay closer
∂t ∂x ∂y
to each other for longer.
where ψ represents a physical concentration that is trans- We designed this adaptive training procedure in order to
ported by the velocity field C = (u, v). We employ pe- obtain stable simulations.
riodic boundary conditions (PBCs) on the domain Ω =
[0, 1] × [0, 1]. 4.1.3. ACCURATE C OARSE G RID S IMULATION
For the FGS, this domain is discretized using Nx = Ny =
256 discretization points in each dimension. To guaran- We also introduce the ’accurate coarse grid simulation’
tee stability, we employ a time step that ensures that the (ACGS) in order to further compare the effecst of numer-
Courant–Friedrichs–Lewy (CFL) (Courant et al., 1928) con- ics and grid size in CGS and FGS. ACGS operates on the
dition is fulfilled. The spatial derivatives are calculated us- coarse grid, just like the CGS, with a higher order numeri-
ing central differences and the time stepping uses the fourth- cal scheme, than the CGS, so that it has the same order of
order Runge-Kutta scheme (Quarteroni & Valli, 2008). The accuracy as the FGS.
FGS is fourth order accurate in time and second order accu-
rate in space. 4.1.4. I N - AND O UT- OF -D ISTRIBUTION MAE
We construct the CGS by employing the subsampling fac- We develop metrics for CNN-MARL by running 100 sim-
tors d = 4 and dt = 4. Spatial derivatives in the CGS use ulations of 50 time steps each with different ICs. For the
an upwind scheme and time stepping is performed with the in-distribution case, the concentrations are sampled from
forward Euler method, resulting in first order accuracy in the MNIST test set and the velocity fields are sampled
both space and time. All simulation parameters for CGS from DTV rain
ortex
. To quantify the performance on out-of-
and FGS are summarized in Table 3. distribution ICs, we also run evaluations on simulations
3
The code to reproduce all results in this paper can be found at using the Fashion-MNIST dataset (F-MNIST) (Xiao et al.,
(URL added upon publication).
5
10.0% 0.0 % 100

10.0%
Error Reduction
Relative Error
Relative Error
Low MAE Steps

-20.0 %
1.0% -40.0 %
1.0% CNN-MARL
CGS
ACGS -60.0 % 10
Training regime
0.1%
0 200 400 0 50 100 0 200 400
Step Step Step CNN-MARL ACGS CGS
Figure 3. Results of CNN-MARL applied to the advection equation. The relative MAEs are computed over 100 simulations with respect
to (w.r.t.) the FGS. The shaded regions correspond to the respective standard deviations. The violin plot on the right shows the number of
simulation steps until the relative MAE w.r.t. the FGS reaches the threshold of 1%.
CNN-MARL reduces the relative MAE of the CGS after

50 steps by 30% or more in both in- and out-of-distribution
cases. This shows that the agents have learned a meaningful
correction for the truncation errors of the numerical schemes
in the coarser grid.
CNN-MARL also outperforms the ACGS w.r.t. the MAE
metric which indicates that the learned corrections emulate
a higher-order scheme. This indicates that the proposed
methodology is able to emulate the unresolved dynamics
and is a suitable option for complementing existing numeri-
cal schemes.
Figure 4. Example run comparing CGS, CNN-MARL and FGS We note the strong performance of our framework w.r.t. the
with the same ICs. The concentration is sampled from the test set out-of-distribution examples. For both unseen and out-of-
and the velocity components are randomly sampled from DTV rain
ortex
. distribution ICs as well as parameters, the framework was
Note that the agents of the CNN-MARL method have only been
able to outperform CGS and ACGS. In our opinion, this
trained up to n = 100. However, they qualitatively improve the
CGS simulation past that point.
indicates that we have discovered an actual model of the
forcing terms that goes beyond the training scenarios.
2017) and a new distribution, DTV estortex

, for the velocity 4.1.5. E VOLUTION OF N UMERICAL E RROR
fields. The latter is defined in Appendix C.2.2. The result-
The results in the previous sections are mostly focused on
ing error metrics of CGS, ACGS and CNN-MARL w.r.t. the
the difference between the methods after a rollout of 50 time
FGS are collected in Table 1. A qualitative example of the
steps. To analyze how the methods compare over the course
CGS, FGS and the CNN-MARL method after training is
of a longer rollout, we analyze the relative MAE at each
presented in Figure 4. The example shows that CNN-MARL
successive step of a simulation with MNIST and DTV rain
ortex
as
is able to compensate for the dissipation that is introduced
distributions for the ICs. The results are shown in Figure 3.
by the first order scheme and coarse grid in the CGS.
The plots of the evolution of the relative error show that
CNN-MARL is able to improve the CGS for the entire
Table 1. Relative MAE at time step 50 averaged over 100 simula-
range of a 400-step rollout, although it has only been trained
tions with different ICs to both the velocity and concentration. All
for 100 steps. This implies that the agents are seeing distri-
relative MAE values are averaged over the complete domain and
reported in percent. butions of the concentration that have never been encoun-
tered during training and are able to generalize to these
Velocity DTV ortex
DTV ortex scenarios. When measuring the duration of simulations for
rain est
Concentr. MNIST F-MNIST MNIST F-MNIST which the relative error stays below 1%, we observe that the
CGS 3.13 ± 0.80 3.23 ± 0.92 3.82 ± 0.78 3.12 ± 0.73
ACGS 1.90 ± 0.51 2.27 ± 0.64 2.28 ± 0.47 2.23 ± 0.50
CNN-MARL method outperforms both ACGS and CGS, in-
CNN-MARL 1.46 ± 0.33 2.12 ± 0.57 1.58 ± 0.37 2.04 ± 0.56 dicating that the method is able to produce simulations with
Relative Improvements w.r.t. CGS higher long term stability than CGS and ACGS. We attribute
ACGS -39% -30% -40% -31% this to our adaptive scheme for episode truncation during
CNN-MARL -53% -34% -58% -36%
6
0.01
Action
0.0
-0.01
-0.01 0.0 0.01
Approx. Perf. Action
n
Figure 5. Visual comparison of the input concentration ψ̃ , taken actions An and the corresponding numerical approximation of the
optimal action from left to right. The figure on the right-hand side qualitatively shows the linear relation between approximated optimal
actions and taken actions.
training as introduced in Section 4.1.2 and note that the

Table 2. Mean and standard deviation of Pearson product-moment
increased stability can be observed well beyond the training
correlation coefficients between µn A and numerically approximated
regime. f n−1 for 100 samples for different combinations of velocity
−∆tϵ̃
fields and initializations of the concentration.
4.1.6. I NTERPRETATION OF ACTIONS
We are able to derive the optimal update rule for the advec- MNIST F-MNIST
tion CGS that negates the errors introduced by the numerical DTV rain
ortex
0.82 ± 0.05 0.70 ± 0.08
V ortex
scheme (see Appendix D for the derivation): DT est 0.82 ± 0.03 0.72 ± 0.08
n+1 n
ψ̃ f n−1 , C̃ n ),
= G(ψ̃ − ∆tϵ̃ (14)
seem surprising that the model would be able to predict this
Here, ϵ̃n−1 ∈ RÑx ×Ñy is the truncation error of the previ- temporal derivative as the agents can only observe the cur-
ous step. However, note that there exists no closed-form rent time step. However, the observation contains both the
solution to calculate the truncation error ϵ̃n−1 if only the PDE solution, i.e. the concentration for the present exam-
state of the simulation is given. ples, as well as the parameters of the PDE, i.e. the velocity
We therefore employ a numerically approximate for the fields. Thus, both the concentration and velocity field are
truncation errors for further analysis: passed into the FCN and due to the velocity fields enough
information is present to infer this temporal derivative.
∆t ∆x ∆y f 2 , ∆x
f 2 , ∆y
f 2)
ϵ̃n =
f n
ψ̃tt + |ũn |
f n
ψ̃xx + |ṽ n |
f n
ψ̃yy + O(∆t
2 2 2 4.2. Burgers’ Equation
(15)
n n n
Here ψ̃xx ,ψ̃yy and ψ̃tt are second derivatives of ψ that are As a second example, we apply our framework to the 2D
numerically estimated using second-order central differ- viscous Burgers’ equation:
ences.
We compare the obtained optimal update rule for the CGS ∂ψ
+ (ψ · ∇)ψ − ν∇2 ψ = 0. (16)
with the predicted mean action µA of the policy and thus ∂t
the learned actions. Figure 5 visualizes an example, which
indicates that there is a strong linear relationship between Here, ψ := (u, v) consists of both velocity components and
the predicted mean action and the respective numerical esti- the PDE has the single input parameter C = ν. As for the
mate of the optimal action. advection equation, we assume periodic boundary condi-
In order to further quantify the similarity between the numer- tions on the domain Ω = [0, 1] × [0, 1]. In comparison to the
f n−1 and the taken action, we compute advection example, we are now dealing with two solution
ical estimates of −∆tϵ̃ i,j variables and thus k = 2.
the Pearson product-moment correlation coefficient for 100
For the FGS, the aforementioned domain is discretized using
samples. The results are presented in Table 2 and show that
Nx = Ny = 150 discretization points in each dimension.
the learned actions as well as the optimal action of the CGS
Moreover, we again choose ∆t to fulfill the CFL condition
are highly correlated for all different combinations of seen
for stability (see Table 4). The spatial derivatives are cal-
and unseen ICs and parameters.
culated using the upwind scheme and the forward Euler
Additionally, we note that the truncation error contains a method is used for the time stepping (Quarteroni & Valli,
second-order temporal derivative. At first glance, it might 2008).
7
100% 100% 0.0 %

100
Error Reduction
-20.0 %
Low MAE Steps

Relative Error
Relative Error
-40.0 % 10
10%
-60.0 %
CGS 10%
CNN-MARL
1
-80.0 %
Training regime
1% -100.0 %
0 200 400 0 100 200 0 200 400 0
Step Step Step CNN-MARL CGS
Figure 6. Results of CNN-MARL applied to the Burgers’ equation. The relative MAEs are computed over 100 simulations with respect to
the FGS. The shaded regions correspond to the respective standard deviations. The violin plot on the right shows the number of simulation
steps until the relative MAE w.r.t. the FGS reaches the threshold of 10%.
We construct the CGS by employing the subsampling fac- 5. Conclusion

tors d = 5 and dt = 10. For the Burgers experiment, we
apply a mean filter K with kernel size d × d before the We propose a novel methodology (CNN-MARL), for the
actual subsampling operation. The mean filter is used to automated discovery of closure models of coarse grained dis-
eliminate higher frequencies in the fine grid state variables, cretizations of time-dependent PDEs. We show that CNN-
which would lead to accumulating high errors. The CGS MARL develops a policy that compensates for numerical
employs the same numerical schemes as the FGS here. This errors in a CGS of the 2D advection and Burgers’ equation.
leads to first order accuracy in both space and time. All the More importantly we find that, after training, the learned clo-
simulation parameters for CGS and FGS are collected in sure model can be used for predictions in extrapolative test
Table 4. cases. In turn, the agent actions are interpretable and highly
In this example, where FGS and CGS are using the same correlated with the error terms of the numerical scheme.
numerical scheme, the CNN-MARL framework has to focus CNN-MARL utilizes a multi-agent formulation with a FCN
solely on negating the effects of the coarser discretization. for both the policy and the value network. This enables the
incorporation of both local and global rewards without ne-
For training and evaluation, we generate random, incom- cessitating individual neural networks for each agent. This
pressible velocity fields as ICs (see Appendix C.2.1 for setup allows to efficiently train a large number of agents
details) and set the viscosity ν to 0.003. We observed that without the need for decoupled subproblems. Moreover, the
the training of the predictions actions can be improved by framework trains on rollouts without needing to backpropa-
multiplying the predicted forcing terms with ∆t. f This is gate through the numerical solver itself.
consistent with our previous analysis in Section 4.1.6 as the A current drawback is the requirement of a regular grid. The
optimal action is also multiplied but this factor. Again, we framework would still be applicable to an irregular grid, but
train the model for 2000 epochs with 1000 transitions each. information about the geometry would need to be added
The maximum episode length during training is set to 200 to the observations. Moreover, for very large systems of
steps and, again, we truncate the episodes adaptively, when interest the numerical schemes used are often multi-grid
the relative error exceeds 20%. Further details on train- and the RL framework should reflect this. For such cases,
ing CNN-MARL for the Burgers’ equation can be found we suggest defining separate rewards for each of the grids
in Appendix C. Visualization of the results can be found employed by the numerical solver.
in Appendix B.2 and the resulting relative errors w.r.t. the We believe that the introduced framework holds significant
FGS are shown in Figure 6. The CNN-MARL method again promise for advancing the discovery of closure models in
improves the CGS significantly, also past the point of the a wide range of systems governed by Partial Differential
200 steps seen during training. Specifically, in the range of Equations as well as other computational methods such as
step 0 to step 100 in which the velocity field changes fastest, coarse grained molecular dynamics.
we see a significant error reductions up to −80%. When
analyzing the duration for which the episodes stay under the
relative error of 10%, we observe that the mean number of
steps is improved by two order of magnitude, indicating that
the method is able to improve the long term accuracy of the
CGS.
8
Broader Impact Statement Furuta, R., Inoue, N., and Yamasaki, T. Pixelrl: Fully
convolutional network with reinforcement learning for
In this paper, we have presented a MARL framework for sys- image processing, 2019.
tematic identification of closure models in coarse-grained
PDEs. The anticipated impact pertains primarily to the sci- Goyal, A. and Bengio, Y. Inductive biases for deep learn-
entific community, whereas the use in real-life applications ing of higher-level cognition. Proceedings of the Royal
would require additional adjustments and expert input. Society A, 478(2266):20210068, 2022.
Greydanus, S., Dzamba, M., and Yosinski, J. Hamiltonian
References neural networks. Advances in neural information process-
Albrecht, S. V., Christianos, F., and Schäfer, L. Multi- ing systems, 32, 2019.
agent reinforcement learning: Foundations and modern
Gupta, J. K. and Brandstetter, J. Towards multi-
approaches. Massachusetts Institute of Technology: Cam-
spatiotemporal-scale generalized pde modeling. arXiv
bridge, MA, USA, 2023.
preprint arXiv:2209.15616, 2022.
Bae, H. J. and Koumoutsakos, P. Scientific multi-agent Hey, T. The fourth paradigm. United States of America.,
reinforcement learning for wall-models of turbulent flows. 2009.
Nature Communications, 13(1):1443, 2022.
Kaltenbach, S. and Koutsourelakis, P.-S. Incorporating
Bergdorf, M., Cottet, G.-H., and Koumoutsakos, P. Multi- physical constraints in a deep probabilistic machine learn-
level adaptive particle methods for convection-diffusion ing framework for coarse-graining dynamical systems.
equations. Multiscale Modeling & Simulation, 4(1):328– Journal of Computational Physics, 419:109673, 2020.
357, 2005.
Kaltenbach, S. and Koutsourelakis, P.-S. Physics-aware,
Brandstetter, J., Berg, R. v. d., Welling, M., and Gupta, J. K. probabilistic model order reduction with guaranteed sta-
Clifford neural layers for pde modeling. ICLR, 2023. bility. ICLR, 2021.
Council, N. R. A National Strategy for Advancing Climate Kaltenbach, S., Perdikaris, P., and Koutsourelakis, P.-S.
Modeling. The National Academies Press, 2012. Semi-supervised invertible neural operators for bayesian
inverse problems. Computational Mechanics, pp. 1–20,
Courant, R., Friedrichs, K., and Lewy, H. Über die
2023.
partiellen differenzengleichungen der mathematischen
physik. Mathematische Annalen, 100(1):32–74, Decem- Karnakov, P., Litvinov, S., and Koumoutsakos, P. Optimiz-
ber 1928. doi: 10.1007/BF01448839. URL http: ing a discrete loss (odil) to solve forward and inverse
//dx.doi.org/10.1007/BF01448839. problems for partial differential equations using machine
learning tools. arXiv preprint arXiv:2205.04611, 2022.
Cranmer, M., Greydanus, S., Hoyer, S., Battaglia, P.,
Spergel, D., and Ho, S. Lagrangian neural networks. Karniadakis, G. E., Kevrekidis, I. G., Lu, L., Perdikaris,
arXiv preprint arXiv:2003.04630, 2020. P., Wang, S., and Yang, L. Physics-informed machine
learning. Nature Reviews Physics, 3(6):422–440, 2021.
Deng, L. The mnist database of handwritten digit images
for machine learning research. IEEE Signal Processing Li, Z., Kovachki, N., Azizzadenesheli, K., Liu, B., Bhat-
Magazine, 29(6):141–142, 2012. tacharya, K., Stuart, A., and Anandkumar, A. Fourier
neural operator for parametric partial differential equa-
Dura-Bernal, S., Suter, B. A., Gleeson, P., Cantarelli, M.,
tions. arXiv preprint arXiv:2010.08895, 2020.
Quintana, A., Rodriguez, F., Kedziora, D. J., Chadderdon,
G. L., Kerr, C. C., Neymotin, S. A., et al. Netpyne, a Ling, J., Kurzawski, A., and Templeton, J. Reynolds aver-
tool for data-driven multiscale modeling of brain circuits. aged turbulence modelling using deep neural networks
Elife, 8:e44494, 2019. with embedded invariance. Journal of Fluid Mechanics,
807:155–166, 2016.
Durbin, P. A. Some recent developments in turbulence
closure modeling. Annual Review of Fluid Mechanics, Lippe, P., Veeling, B. S., Perdikaris, P., Turner, R. E.,
50:77–103, 2018. and Brandstetter, J. Pde-refiner: Achieving accurate
long rollouts with neural pde solvers. arXiv preprint
Freed, B., Kapoor, A., Abraham, I., Schneider, J., and arXiv:2308.05732, 2023.
Choset, H. Learning cooperative multi-agent policies
with partial reward decoupling. IEEE Robotics and Au- Long, J., Shelhamer, E., and Darrell, T. Fully convolutional
tomation Letters, 7(2):890–897, 2021. networks for semantic segmentation, 2015.
9
Lu, L., Jin, P., Pang, G., Zhang, Z., and Karniadakis, G. E. Sutton, R. S. and Barto, A. G. Reinforcement learning: An
Learning nonlinear operators via deeponet based on the introduction. MIT press, 2018.
universal approximation theorem of operators. Nature
machine intelligence, 3(3):218–229, 2021. Sutton, R. S., Machado, M. C., Holland, G. Z., Szepes-
vari, D., Timbers, F., Tanner, B., and White, A. Reward-
Mahadevan, A. The impact of submesoscale physics on pri- respecting subtasks for model-based reinforcement learn-
mary productivity of plankton. Annual review of marine ing. Artificial Intelligence, 324:104001, 2023.
science, 8:161–184, 2016.
Uchendu, I., Xiao, T., Lu, Y., Zhu, B., Yan, M., Simon, J.,
Novati, G., Mahadevan, L., and Koumoutsakos, P. Con- Bennice, M., Fu, C., Ma, C., Jiao, J., et al. Jump-start
trolled gliding and perching through deep-reinforcement- reinforcement learning. In International Conference on
learning. Physical Review Fluids, 4(9):093902, 2019. Machine Learning, pp. 34556–34583. PMLR, 2023.
Novati, G., de Laroussilhe, H. L., and Koumoutsakos, P. Walke, H. R., Yang, J. H., Yu, A., Kumar, A., Orbik, J.,
Automating turbulence modelling by multi-agent rein- Singh, A., and Levine, S. Don’t start from scratch: Lever-
forcement learning. Nature Machine Intelligence, 3(1): aging prior data to automate robotic reinforcement learn-
87–96, 2021. ing. In Conference on Robot Learning, pp. 1652–1662.
Peng, G. C., Alber, M., Buganza Tepole, A., Cannon, W. R., PMLR, 2023.
De, S., Dura-Bernal, S., Garikipati, K., Karniadakis, G., Wang, S., Wang, H., and Perdikaris, P. Learning the solution
Lytton, W. W., Perdikaris, P., et al. Multiscale modeling operator of parametric partial differential equations with
meets machine learning: What can we learn? Archives of physics-informed deeponets. Science advances, 7(40):
Computational Methods in Engineering, 28:1017–1037, eabi8605, 2021.
2021.
Wang, S., Sankaran, S., and Perdikaris, P. Respecting causal-
Quarteroni, A. and Valli, A. Numerical approximation ity is all you need for training physics-informed neural
of partial differential equations, volume 23. Springer networks. arXiv preprint arXiv:2203.07404, 2022.
Science & Business Media, 2008.
Wen, M., Kuba, J., Lin, R., Zhang, W., Wen, Y., Wang, J.,
Raissi, M., Perdikaris, P., and Karniadakis, G. E. Physics-
and Yang, Y. Multi-agent reinforcement learning is a
informed neural networks: A deep learning framework for
sequence modeling problem. Advances in Neural Infor-
solving forward and inverse problems involving nonlinear
mation Processing Systems, 35:16509–16521, 2022.
partial differential equations. Journal of Computational
physics, 378:686–707, 2019. Weng, J., Chen, H., Yan, D., You, K., Duburcq, A.,
Zhang, M., Su, Y., Su, H., and Zhu, J. Tianshou:
Rossinelli, D., Hejazialhosseini, B., Hadjidoukas, P., Bekas,
A highly modularized deep reinforcement learning li-
C., Curioni, A., Bertsch, A., Futral, S., Schmidt, S. J.,
brary. Journal of Machine Learning Research, 23(267):1–
Adams, N. A., and Koumoutsakos, P. 11 PFLOP/s
6, 2022. URL http://jmlr.org/papers/v23/
simulations of cloud cavitation collapse. In Proceed-
21-1127.html.
ings of the International Conference on High Perfor-
mance Computing, Networking, Storage and Analysis, Wilcox, D. C. Multiscale model for turbulent flows. AIAA
SC ’13, pp. 3:1–3:13, New York, NY, USA, 2013. journal, 26(11):1311–1320, 1988.
ACM. ISBN 978-1-4503-2378-9. doi: 10.1145/2503210.
2504565. URL http://doi.acm.org/10.1145/ Xiao, H., Rasul, K., and Vollgraf, R. Fashion-mnist: a
2503210.2504565. novel image dataset for benchmarking machine learning
algorithms, 2017.
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and
Klimov, O. Proximal policy optimization algorithms. Yang, Y. and Wang, J. An overview of multi-agent reinforce-
CoRR, abs/1707.06347, 2017. URL http://arxiv. ment learning from game theoretical perspective. arXiv
org/abs/1707.06347. preprint arXiv:2011.00583, 2020.
Schulman, J., Moritz, P., Levine, S., Jordan, M., and Abbeel, Yin, Y., Le Guen, V., Dona, J., de Bézenac, E., Ayed, I.,
P. High-dimensional continuous control using generalized Thome, N., and Gallinari, P. Augmenting physical models
advantage estimation, 2018. with deep networks for complex dynamics forecasting.
Journal of Statistical Mechanics: Theory and Experiment,
Seidman, J., Kissas, G., Perdikaris, P., and Pappas, G. J. No- 2021(12):124012, 2021.
mad: Nonlinear manifold decoders for operator learning.
Advances in Neural Information Processing Systems, 35: Zhang, K., Zuo, W., Gu, S., and Zhang, L. Learning deep
5601–5613, 2022. cnn denoiser prior for image restoration, 2017.
10
Algorithm 1 Adapted PPO Algorithm

Input: Initial policy parameters θ1 , initial value function parameters ϕ1 ,
clip ratio ϵ, discount factor γ, entropy regularization weight β
for k = 1, 2, . . . do
Collect set of trajectories Dk with the global policy ΠF θk
CN
in the environment
# Update global policy
Update θ by performing a SGD step on θk+1 = arg maxθ ∥D1k ∥ On ,Gn ∈Dk LΠ (O n , An , θk , θ, β)
P
# Update global value network PN

Compute returns for each transition using Gn = i=n γ i−t Ri where N is the length of the respective trajectory
Update ϕ by performing a SGD step on ϕk+1 = arg minϕ ∥D1k ∥ On ,Gn ∈Dk LV (O n , Gn , ϕ)
P
end for
A. Adapted PPO Algorithm

As defined in Section 3.5 the loss function for the global value function is
LV (O n , Gn , ϕ) = ||VϕF CN (O n ) − Gn ||22 .
We note that this notation contains the parameters ϕ, which parameterize the underlying neural network.
The objective for the global policy is defined as
Ñx ,Ñy
1 X
LΠ (O n , An , θk , θ, β) := n
Lπij (Oij , Anij , θk , θ) − βH(πij,θ ),
Ñx · Ñy i,j=1
where H(πij,θ ) is the entropy of the local policy and Lπij is the standard single-agent PPO-Clip objective

πij,θ (a|o) πij,θ (a|o)
Lπij (o, a, θk , θ) = min Adv πij,θk (o, a), clip , 1 − ϵ, 1 + ϵ Adv πij,θk (o, a) .
πij,θk (a|o) πij,θk (a|o)
The advantage estimates Adv πij,θk are computed with generalized advantage estimation (GAE) (Schulman et al., 2018)
using the output of the global value network VϕF CN .
The resulting adapted PPO algorithm is presented in Algorithm 1. Major differences compared to the original PPO algorithm
are vectorized versions of the value network loss and PPO-Clip objective, as well as a summation over all the discretization
points of the domain before performing an update step.
11
B. Additional Results
B.1. Advection Equation
Figure 7. ψ 0 is sampled from the MNIST test set. The velocity field is sampled from DTV est
ortex
(See Appendix C.2.1). Here, the IC of the
concentration comes from a different distribution than the one used for training.
Figure 8. ψ 0 is a sample from the F-MNIST test set. The velocity field is sampled from DTV rain
ortex
(See Appendix C.2.2). Note that the
velocity field comes from a different distribution than the one used for training.
12
B.2. Burgers’ Equation
Figure 9. ψ 0 is a sample from DTV rain

ortex
. We plot the velocity magnitude of every 20th step of simulation with 100 coarse time steps. The
example shows that CNN-MARL does also qualitatively keep the simulation closer to the FGS. The failure of the CGS to account for the
subgrid-scale dynamics leads to diverging trajectories.
Figure 10. Velocity magnitude plotted for a 100-step roll out. The set-up is the same as in Figure 9 and the example again shows, that
CNN-MARL leads to qualitatively different dynamics during a roll-out that are closer to those of the FGS.
13
C. Technical Details on Hyperparamters and Training Runs

During training, we use entropy regularization in the PPO objective with a factor of 0.1 and 0.05 for the advection and
Burgers’ equation respectively to encourage exploration. The discount factor is set to 0.95 and the learning rate to 1 · 10−5 .
Training is done over 2000 epochs. In each epoch, 1000 transitions are collected. One policy network update is performed
after having collected one new episode. We use a batch size of 10 for training. The total number of trainable parameters
amounts to 188, 163 and the entire training procedure took about 8 hours for the advection equation on an Nvidia A100
GPU. For the Burgers’ equation, training took about 30 hours on the same hardware. We save the policy every 50 epochs
and log the corresponding MAE between CGS and FGS after 50 time steps. For evaluation on the advection equation, we
chose the policy from epoch 1500 because it had the lowest logged MAE value. Figure 11 and Figure 12 show the reward
curves and evolutions of episode lengths. As expected, the episode length increases as the agents become better at keeping
the CGS and FGS close to each other.
Figure 11. Visualizations of the evolution of the reward metric averaged over the agents and episode length during training on the advection
equation.
Figure 12. Visualizations of the evolution of the reward metric averaged over the agents and episode length during training on the Burgers’
equation.
14
Table 3. Numerical values of the simulation parameters used for the advection CGS and FGS. Note that ∆t
f is chosen to guarantee that the
CFL condition in the CGS is fulfilled.
Ω [0, 1] × [0, 1]
Ñx , Ñy 64
d, dt 4
Resulting other parameters:
Nx , Ny d · Ñx , d · Ñy = 256
∆x, ∆y 1/Nx , 1/Ny ≈ 0, 0039
∆x,
f ∆y f 1/Ñx , 1/Ñy ≈ 0, 0156
∆t
f 0.9 · min(∆x,f ∆y) f ≈ 0, 0141
∆t ∆t/dt
f ≈ 0, 0035
Discretization schemes:
FGS, Space Central difference
FGS, Time Fourth-order Runge-Kutta
CGS, Space Upwind
CGS, Time Forward Euler
Table 4. Numerical values of the simulation parameters used for the Burgers’ CGS and FGS. Again, ∆t
f is chosen to guarantee that the
CFL condition in the CGS is fulfilled.
Ω [0, 1] × [0, 1]
Ñx , Ñy 30
d, dt 5, 10
Resulting other parameters:
Nx , Ny d · Ñx , d · Ñy = 150
∆x, ∆y 1/Nx , 1/Ny ≈ 0, 0067
∆x,
f ∆y f 1/Ñx , 1/Ñy ≈ 0, 0333
∆t
f 0.9 · min(∆x,f ∆y) f = 0, 03
∆t ∆t/dt
f = 0, 003
Discretization schemes:
FGS, Space Upwind
FGS, Time Forward Euler
CGS, Space Upwind
CGS, Time Forward Euler
C.1. Receptive Field of FCN

In our CNN-MARL problem setting, the receptive field of the FCN corresponds to the observation Oij the agent at point
(i, j) is observing. In order to gain insight into this, we analyze the receptive field of our chosen architecture.
In the case of the given IRCNN architecture, the size of the receptive field (RF) of layer i can be recursively calculated given
the RF of layer afterward with
RFi+1 = RFi + (Kernel Sizei+1 − 1) · Dilationi+1 (17)
= RFi + 2 · Dilationi+1 . (18)
(19)
The RF field of the first layer RF1 is equal to its kernel size. By then using the recursive rule, we can calculate the RF at
each layer and arrive at a value of RF7 = 33 for the entire network. From this, we now arrive at the result that agent (i, j)
15
sees a 33 × 33 patch of the domain centered around its own location.
Table 5. Hyperparameters of each of the convolutional layers of the neural network architecture used for the advection equation experiment
and the resulting receptive field (RF) at each layer. For the Burgers’ equation experiment the architecture is simply adapted by setting the
number of in channels of Conv2D 1 to 2 and the number of out channels of Conv2D π to 4.
Layer In Channels Out Channels Kernel Padding Dilation RF

Conv2D 1 3 64 3 1 1 3
Conv2D 2 64 64 3 2 2 7
Conv2D 3 64 64 3 3 3 13
Conv2D 4 64 64 3 4 4 21
Conv2D 5 64 64 3 3 3 27
Conv2D 6 64 64 3 2 2 31
Conv2D π 64 2 3 1 1 33
Conv2D V 64 1 3 1 1 33
C.2. Diverse Velocity Field Generation

C.2.1. D ISTRIBUTION FOR T RAINING
For the advection equation experiment, the velocity field is randomly generated by taking a linear combination of Taylor-
Greene vortices and an additional random translational field. Let uTijG,k , v TijG,k be the velocity components of the Taylor
Greene Vortex with wave number k that are defined as
uTijG,k := cos(kxi ) · sin(ky j ) (20)

v TijG,k := − sin(kxi ) · cos(ky h ). (21)
Furthermore, define the velocity components of a translational velocity field as uT L , v T L ∈ R. To generate a random
incompressible velocity field, we sample 1 to 4 k’s from the set {1, ..., 6}. For each k, we also sample a signk uniformly
from the set {−1, 1} in order to randomize the vortex directions. For an additional translation term, we sample uT L , v T L
independently from uniform(−1, 1). We then initialize the velocity field to
X
uij := uT L + signk · uTijG,k (22)
k
X
v ij := v TL
+ signk · v TijG,k . (23)
k
We will refer to this distribution of vortices as DTV rain

ortex
.
For the Burgers’ equation experiment, we make some minor modifications to the sampling procedure. The sampling of
translational velocity components is omitted and 2 to 4 k’s are sampled from {2, 4, 6, 8}. The latter ensures that the periodic
boundary conditions are fulfilled during initialization which is important for the stability of the simulations.
C.2.2. D ISTRIBUTION FOR T ESTING

First, a random sign sign is sampled from the set {−1, 1}. Subsequently, a scalar a is randomly sampled from a uniform
distribution bounded between 0.5 and 1. The randomization modulates both the magnitude of the velocity components and
the direction of the vortex, effectively making the field random yet structured. The functional forms of uC C
ij and v ij are then
expressed as
uVij := sign · a · sin2 (πxi ) sin(2πy j ) (24)

2
v Vij := −sign · a · sin (πy j ) sin(2πxi ). (25)
In the further discussion, we will refer to this distribution of vortices as DTV est
ortex
.
16
Table 6. Runtime of one simulation step in ms of the different simulations averaged over 500 steps.
CGS ACGS CNN-MARL FGS

A DVECTION 0.31 ± 0.00 0.96 ± 0.00 2.66 ± 0.96 89.52 ± 0.47
B URGERS ’ 0.25 ± 0.00 - 1.82 ± 0.01 10.16 ± 0.03
C.3. Computational Complexity

To quantitatively compare the execution times of the different simulations, we measure the runtime of performing one
update step of the environment and report them in Table 6. As expected, CNN-MARL increases the runtime of the CGS.
However, it stays below the FGS times by at least a factor of 5. The difference is especially pronounced in the example of
the advection equation, where the FGS uses a high order scheme on a fine grid, which leads to an execution time difference
between CNN-MARL and FGS of more than an order of magnitude.
17
D. Proof: Theoretically Optimal Action

To explore how the actions of the agents can be interpreted, we analyze the optimal actions based on the used numerical
schemes. The perfect solution would fulfill
M(ψñ ) + ϵñ = 0.
Here, M is the numerical approximation of the PDE an ϵñ is the truncation error.
We refer to the numerical approximations of the derivatives in the CGS as T F E for forward Euler and DU W for the upwind
scheme. We obtain
0 = M(ψñ ) + ϵ̃n (26)

FE n n+1 n n n n n
=T (ψ̃ , ψ̃ ) + ũ DxU W (ψ̃ ) + ṽ DyU W (ψ̃ ) + ϵ̃ (27)
n+1 n
ψ̃ − ψ̃ n n
= + ũn DxU W (ψ̃ ) + ṽ n DyU W (ψ̃ ) + ϵ̃n . (28)
∆t
f
(29)
By rewriting, we obtain the following time stepping rule

n+1 n
f ũn DxU W (ψ̃ n ) + ṽ n DyU W (ψ̃ n ) + ϵ̃n

ψ̃ = ψ̃ − ∆t (30)
n
f ũn DxU W (ψ̃ n ) + ṽ n DyU W (ψ̃ n ) − ∆tϵ̃

= ψ̃ − ∆t f n (31)
n
f n.
= G(ψ̃ , C̃ n ) − ∆tϵ̃ (32)
where C̃ n := (ũn , ṽ n ). This update rule would theoretically find the exact solution. It involves a coarse step followed by an
n+1 n+1
additive correction after each step. We can define ϕ̃ := ψ̃ f n , to bring the equation into the same form as seen
+ ∆tϵ̃
in the definition of the RL environment
n+1 n
ϕ̃ = G(ϕ̃ − ∆tϵ̃f n−1 , C̃ n ),
which illustrates that the optimal action at step n would be the previous truncation error times the time increment
An ∗ = ∆tϵ̃
f n−1 . (33)
18

Closure Discovery For Coarse-Grained Partial Differential Equations Using Multi-Agent Reinforcement Learning

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Closure Discovery For Coarse-Grained Partial Differential Equations Using Multi-Agent Reinforcement Learning

Uploaded by

Copyright:

Available Formats

Closure Discovery for Coarse-Grained Partial Differential Equations

using Multi-Agent Reinforcement Learning

Jan-Philipp von Bassewitz 1 2 Sebastian Kaltenbach 1 2 Petros Koumoutsakos 2

Abstract predict the system’s dynamics may use trillions of com-

with the mean expected reward

3.3. Multi-Agent Formulation

In further discussions, we will refer to ΠF CN as the global

πij (Oij ) := ΠF CN (O) :ij .

Here, Oij contains only a part of the input contained in O 2 .

action of the agent. 4.1.1. I NITIAL C ONDITIONS

10.0% 0.0 % 100

Low MAE Steps

CNN-MARL reduces the relative MAE of the CGS after

2017) and a new distribution, DTV estortex

training as introduced in Section 4.1.2 and note that the

100% 100% 0.0 %

Low MAE Steps

We construct the CGS by employing the subsampling fac- 5. Conclusion

Algorithm 1 Adapted PPO Algorithm

# Update global value network PN

A. Adapted PPO Algorithm

B.2. Burgers’ Equation

Figure 9. ψ 0 is a sample from DTV rain

C. Technical Details on Hyperparamters and Training Runs

C.1. Receptive Field of FCN

sees a 33 × 33 patch of the domain centered around its own location.

Layer In Channels Out Channels Kernel Padding Dilation RF

C.2. Diverse Velocity Field Generation

uTijG,k := cos(kxi ) · sin(ky j ) (20)

We will refer to this distribution of vortices as DTV rain

C.2.2. D ISTRIBUTION FOR T ESTING

uVij := sign · a · sin2 (πxi ) sin(2πy j ) (24)

CGS ACGS CNN-MARL FGS

C.3. Computational Complexity

D. Proof: Theoretically Optimal Action

0 = M(ψ˜n ) + ϵ̃n (26)

By rewriting, we obtain the following time stepping rule

You might also like