You are on page 1of 16

Computers and Chemical Engineering 163 (2022) 107858

Contents lists available at ScienceDirect

Computers and Chemical Engineering


journal homepage: www.elsevier.com/locate/compchemeng

Neural network programming: Integrating first principles into


machine learning models
Andres Carranza-Abaid∗, Jana P. Jakobsen
Department of Chemical Engineering, Norwegian University of Science and Technology (NTNU), Trondheim NO-7491, Norway

a r t i c l e i n f o a b s t r a c t

Article history: This work introduces Neural Network Programming (NNP) as an integrated hybrid modelling approach.
Received 4 October 2021 NNP consists in formulating a set of first principles equations that is later decomposed and transcribed
Revised 31 March 2022
into an Algorithmically Structured artificial Neural Network (ASNN). NNP leverages the advantages of the
Accepted 24 May 2022
universal approximation theorem and neural network optimization algorithms in order to generate phys-
Available online 26 May 2022
ically coherent machine learning models. Since ASNNs are not mere approximations of physics equations,
Keywords: it is not necessary to modify either the gradient or performance function in order to account for errors
Hybrid modelling with respect to the first principles equations. ASNNs are trained faster and more accurately than typical
Physics-informed machine learning hybrid models because the gradient is computed through automatic differentiation instead of numeric
Numerical analysis differentiation. It is shown that the same ASNN architecture is transferable between processes with sim-
Supervised learning ilar characteristics. In particular, a flash separator, distillation column, and a biogas upgrading process,
Surrogate modelling
were modelled using an identical architecture.
© 2022 The Author(s). Published by Elsevier Ltd.
This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/)

1. Introduction and high-level decision making (Silver et al., 2016; Vinyals et al.,
2019).
The complex nature of the physics involved in engineering The outstanding performance of ML algorithms together with
processes is usually reflected in highly intertwined and sophisti- important media coverage has brought the awareness of their ca-
cated mechanistic models. This complexity has always been both pabilities, not only to academia, but also to the general public.
a challenge and a motivation for the formulation of new mod- However, ML has been around for decades or even centuries if one
elling strategies and techniques. Although quite rigorous, mecha- considers that linear regression was pivotal in the development of
nistic models often exhibit considerable deviations from experi- these algorithms. Fig. 1 shows that the overall relative amount of
mental measurements, perhaps due to unreasonable assumptions ML research publications indexed in SCOPUS with respect to the
about the physics or maybe because some effects were not prop- end of the previous decade has been rapidly increasing. We ex-
erly characterized. Recently, the modelling community has shifted pect that, in the same fashion as with other novel technologies,
its attention to data-driven modelling tools. From quite a few mod- the annual research output will reach a maximum and then a de-
elling alternatives, machine learning algorithms seem to have the cline will come afterwards. Nevertheless, due to the advances in
potential to become the dominant tool in the Industry 4.0 era computer science we expect that this technology will be around
(Sansana et al., 2021a; Venkatasubramanian, 2019). From all the for several decades. Therefore, it is fundamental to keep improving
ML methods, Artificial Neural Networks (ANNs) are particularly in- data-driven modelling paradigms.
teresting since they have been used to predict the unfolding mech- There might be some well-founded skepticism about the reli-
anism of macromolecules (Torrisi et al., 2020), redesign proteins ability and applicability of ML and ANNs in many chemical en-
(Xu et al., 2020) or even perform activities that require complex gineering subdisciplines. This might be due to the lack of trans-
parency and the fact that purely data-driven models override first
principles relationships. Some of these concerns have been par-
Abbreviations: ANN, artificial neural network; ASNN, algorithmically structured tially addressed in the past with the introduction of hybrid mod-
neural network; DNN, deep neural network; NNP, neural network programming; elling algorithms. However, from our perspective, in most cases
SNN, shallow neural network. the traditional hybrid modelling methodologies address the physics

Corresponding author. problems more from a computer science perspective rather than
E-mail address: andres.c.abaid@ntnu.no (A. Carranza-Abaid).

https://doi.org/10.1016/j.compchemeng.2022.107858
0098-1354/© 2022 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/)
A. Carranza-Abaid and J.P. Jakobsen Computers and Chemical Engineering 163 (2022) 107858

2. Modelling methodologies

2.1. First principles models

First principles models are also known as mechanistic, semi-


empirical, phenomenological, or white-box models. They are
catalogued as rigorous although they are founded on an idealized
human interpretation of physics phenomena. Due to their inherent
simplifications, these models usually possess empirical param-
eters that account for unmodelled dynamics (i.e., unmodelled
phenomena). Mechanistic models can be regarded as interpretable
and physically coherent because the mathematical equations are
compelled to be consistent with physics laws (e.g., conservation,
Fig. 1. Relative number of research documents indexed in SCOPUS (The relative kinetics, or thermodynamics laws) despite the use of empirical
number of publications was calculated by dividing the yearly number of publica-
tions over the number of publications available in 2010). Keywords: “machine learn-
parameters.
ing” OR “artificial neural networks”. “Limited to” Chemical engineering whenever There are different mathematical frameworks that can be uti-
applicable. lized for mechanistic modelling in chemical engineering. The
first type of frameworks is known as equation-oriented mod-
elling. These frameworks utilize their own user-friendly program-
ming language, auxiliary chemical-engineering-oriented routines,
from a chemical engineering or physics perspective. Although in and require defining all involved variables and equations. Some
some chemical engineering subdisciplines utilizing relaxed physics notable examples of equation-oriented modelling frameworks
models is convenient, this is not valid in several subdisciplines in are gPROMS, Aspen Custom Modeler, Dymola (Elmqvist, 1978),
which it is mandatory that certain mathematical relationships are ASCEND II (Piela et al., 1991), Omola/Omsim (Cellier and
exactly conserved and not only approximated. One example of this Elmqvist, 1993; Karl Johan Åström, 1998), EMSO (Soares and Sec-
is found in thermodynamics, where several restrictions must be ac- chi, 2003), ICAS-MoT (Heitzig et al., 2014), JModelig.org, Mo-
counted for in order to be able to call a set of equations a “thermo- saic/Optimica (Åkesson et al., 2010; Kuntsche et al., 2011), DAE
dynamic model”. Some other subfields that require a more rigorous Tools (Nikolić, 2016), Daedalus modelling framework (Leal et al.,
description of the physics are transport phenomena, kinetics mod- 2017), among others. The second category corresponds to phe-
elling or thermophysical property modelling, which we consider, nomenological modelling frameworks. These frameworks ex-
will find this work especially interesting. plicitly utilize first principles to semiautomatically generate
This work proposes a hybrid modelling method named Neu- equation-oriented models, where the user specifies informa-
ral Network Programming (NNP) that utilizes both theoretical tion about the problem such as assumptions or the topol-
knowledge and the universal approximator capabilities of ANNs. In ogy of the system. Some phenomenological modelling frame-
essence, NNP consists in integrating a first-principles model within works are Model.la (Stephanopoulos et al., 1990), Techtool
a customized neural network type called Algorithmically Struc- (Linninger and Krendl, 1999), ModKit (Bogusch et al., 2001), On-
tured Neural Network (ASNN). NNP not only provides a new set toCAPE (Morbach et al., 2007), Mobatec (Westerweele and Lau-
of hybrid models but also gives a new perspective on how to uti- rens, 2008), computer-aided modelling template (Fedorova et al.,
lize and approach neural networks. In this way, chemical engineers 2015) and an ontology builder for physics-chemical processes
can formulate customized architectures instead of only utilizing (Preisig, 2015).
borrowed predefined generic architectures (e.g., typical fully con- In some cases, these frameworks are either unknown by the
nected neural networks). The NNP method has some advantages modelers or are not suitable for specific studies. For instance, there
over purely data-driven algorithms and typical hybrid modelling are not any open packages for bifurcation analysis that can be uti-
configurations including their rigorousness, extrapolation capabil- lized for thermally coupled distillation columns, thus, a combina-
ities, interpretability, auto differentiability (optimized faster and tion of different libraries (MatCont and Aspen) was needed in or-
more accurately), ability of representing limit cases (e.g., an ap- der to perform the hysteresis study (Carranza-Abaíd and González-
propriate ASNN of a flash separator will not compute vapor molar García, 2020; Dhooge et al., 2003). Another issue regarding the
fractions of a component whose molar composition is 0), and par- development of mechanistic models is their high requirement for
allel computations (in order to simulate several systems a single multiple model subroutines that complement the first-principles
call to the ASNN is required instead of using “for” or “while” cy- model. For example, in order to model an absorption column, mod-
cles). Due to the parallel computation advantages given by quan- els for vapor-liquid equilibrium, kinetics, viscosity, surface tension,
tum computing (Huang et al., 2020), this last feature might be- diffusivity and packing correlations are needed (Faramarzi et al.,
come critical in the coming decades. All in all, we expect that this 2010). The number of subroutines needed in many unit operations
approach work will have a signifiant positive impact on the in- can be inconvenient. Thus, in many cases, data-driven models seem
tegration of thermodynamics, kinetics, and transport phenomena to be a more practical option.
into ASNNs. Additionally, NNP will aid in the formulation of sur-
rogate models by replacing the nonlinear sections of the models 2.2. Data-driven models (Artificial neural networks)
with fully connected ANNs while keeping the rigorous structure
of selected mathematical relationships. For example, it is shown Machine learning models are data-driven models that consist
that a two phase flash can utilize a fully connected ANN as a sur- of an arrangement of equations that do not explicitly describe the
rogate to avoid the iterative calculations (cause by the thermody- causal interactions between the data, thus, they entirely rely on
namic equations) while obtaining th exact solution of the system the data and their quality instead of the shape of the equations.
of linear equations (mass balances). This allows the users to re- Because of their lack of interpretability, these models are usually
place the nonlinear sections of the model while only solving the known as black boxes. Some of the most relevant data-driven mod-
simpler parts of the model (e.g., linear equations or simple nonlin- elling methods for chemical engineering are linear regression, Ar-
ear relationships). tificial Neural Networks (ANNs), Support Vector Machines (SVM),

2
A. Carranza-Abaid and J.P. Jakobsen Computers and Chemical Engineering 163 (2022) 107858

Fig. 2. Examples of hidden layers: a) Linear combination with hyperbolic tangent


transfer function and b) Hadamard product.

Multivariate Adaptative Regression Splines (MARS), latent variable


methods, dimensionality reduction methods, to name a few.
ANNs —also known as multilayer perceptrons — are the most
prominent data-driven models because of their ability to model Fig. 3. FFNN sketches: a) shallow neural network and b) deep neural network.
highly non-linear systems. They were developed as representa-
tions of how biological brains process information (Bishop, 2006;
McCulloch and Pitts, 1943; Rosenblatt, 1962). In other words, ANNs where the  symbol stands for the Schur product. Once again, the
are mathematical models that we, as humans, developed to “un- fitting parameters matrices μ1 , μ2 and μ3 reshape the input vec-
derstand” how we “understand” things. However, due to the ex- tors (X1 , X2 and X3 ) to the size of the output vector ( p x 1 dimen-
treme complexity of biological brains, ANNs seem to have found sions) to perform the operations.
broader use in data and computer science rather than in neuro- As seen in Fig. 2, this work introduces a new way of expressing
logical sciences. Their modelling robustness is due to the universal the architecture of an ANN. This was proposed because we con-
approximation feature of the multilayer perceptrons demonstrated sider that the typical fully connected representation does not pro-
in the late 1980s (Cybenko, 1989; Funahashi, 1989; Hornik, 1991; vide valuable information aside from the overall input-output re-
Hornik et al., 1989). lationship. The symbols in bold represent the name of the hidden
ANNs are constituted by input layers, output layers and hidden layer while the subindexes and superscripts are related to the char-
layers. The input layers transfer the information specified by the acteristics of the hidden layer. Every layer in this representation
user to the neural network and the output layers deliver the re- may have up to 2 subscripts and 2 superscripts. The first subscript
sults back to the user. Hidden layers are constituted by smaller indicates represents the number of artificial neurons p in the hid-
building blocks called artificial neurons that transform the infor- den layer. The second subscript symbolizes the use transfer func-
mation in order to provide a numerical prediction of the outputs tion: t for the hyperbolic tangent transfer function, e for the expo-
as a function of the inputs. Each artificial neuron (hence all hidden nential function, l for the natural logarithm transfer function,  for
layers) must follow the artificial neuron signal transformation pro- the Hadamard division function, s for the saturating linear trans-
cess, which consists in combining the input vectors and applying a fer function, r for the rectified linear transfer function, and a blank
transfer function afterwards. The combination of the input vectors space for the linear transfer function. The first superscript indicates
is usually done through a linear combination. However, the input whether the linear combination (+) or the Hadamard product ()
vectors can also be combined using a Schur / Hadamard product is used. The second superscript indicates whether there is a bias
(element-by-element product). The Hadamard product as a means (marked with β ) or not (no second superscript).
of combining input functions has been mostly used for image pro- This work focuses on the application of the NNP method to
cessing and classification (e.g., (Gao et al. 2018; Jain et al. 2005; Feedforward Neural Networks (FFNNs). FFNNs are a static time-
Manevitz and Yousef 2007; Perez et al. 2018)). On the other hand, invariant neural network and are divided in two subclasses: shal-
transfer functions are usually nonlinear or logic functions like the low artificial neural networks (e.g., Fig. 3a)) and deep neural net-
linear hyperbolic tangent function, log-sigmoid function, or recti- works (e.g., Fig. 3b)). The difference between a shallow neural net-
fied linear unit (ReLU). In some cases, it is not necessary to per- work and a deep neural network is that the former must have 2
form a non-linear transformation and the transfer function is sim- hidden layers while latter should have 3 or more hidden layers. In
ply a linear function. general, shallow neural networks are related to “machine learning”
In order to illustrate the artificial neuron signal transformation while deep neural networks are related to “deep learning”. Despite
process, a hidden layer formed of p artificial neurons that linearly their differences, according to the universal approximation theo-
combines the input layers and uses a hyperbolic tangent transfer rem, both shallow and deep neural networks are universal approx-
function is shown in Fig. 2a). This hidden layer is mathematically imators.
described as
2.3. Hybrid models
L1 = tanh(μ1 X1 + μ2 X2 + β ) [ p], (1)
Data-driven models that are, to some extent, physics-driven are
where L1 is the output vector with p x 1 dimensions ( p is the num-
known as hybrid models or grey-box models. Hybrid modelling
ber of artificial neurons), X1 is an input vector with q1 x 1 dimen-
combines the flexibility of ML-based models with the rigorousness
sions while X2 has q2 x 1 dimensions. The square brackets to the
of mechanistic models. Specifically, they utilize the physics frame-
right of the equation are utilized to encompass the number of arti-
work inherent to mechanistic models and a data-driven model to
ficial neurons in a particular hidden layer. The weight matrices μ1
reduce the discrepancy between the hybrid model and the experi-
and μ2 have p x q1 and p x q2 dimensions, respectively, while the
mental data. The typically reported benefits of using hybrid mod-
bias β has p x 1 dimensions. In every hidden layer, the weight ma-
els over the mechanistic and data-driven modelling methodologies
trices reshape the size of the input vectors so that the new column
are: lower data requirement, more interpretability, more accuracy,
vector has as many elements as number of artificial neurons.
and more compliance with physics than purely data-driven models
A second example containing the Schur product (Fig. 2b)) is de-
(Bikmukhametov and Jäschke, 2020; Thompson and Kramer, 1994;
scribed with the following equation
Zorzetto et al., 20 0 0). Several data-driven models have been uti-
L2 = (μ1 X1 )  (μ2 X2 )  (μ3 X3 )  (β ) [ p], (2) lized in hybrid models such as ANNs, support vector machines

3
A. Carranza-Abaid and J.P. Jakobsen Computers and Chemical Engineering 163 (2022) 107858

reactor system where the output signals of both individual models


were summed. A similar approach was proposed by Thompson and
Kramer (1994) for the predictions of the kinetics in a fermentation
study. Other examples are the knowledge based modular networks
for yeast production processes (Peres et al., 20 0 0), process control
with knowledge-enhanced algorithms (Xiong and Jutan, 2002) or
monitoring and control of bioreactors (Oliveira, 2004). More re-
cently, improved algorithms and structures that provide enhanced
accuracy in parallel hybrid models have been proposed for flowme-
ters (Bikmukhametov and Jäschke, 2020).
The decision of which modelling structure is better is a
function of the amount of available data and the accuracy of
the mechanistic model. In general, it has been proposed by
Sansana et al. (2021) as generally rule to choose parallel models
if the deviation between the mechanistic model and the measure-
ments is considerable; otherwise, a serial model is preferred.
On the other hand, there are other important types of hybrid
modelling paradigms that are not based on these structures. They
add information through the gradient of the performance function
rather than with a mechanistic model. Their main feature is the
utilization of a loss function comprised of different error terms,
which are specific to each method. Some of the most remark-
able examples is the solution of Partial Differential Equations (PDE)
based on Physics-Informed Neural Networks (PINNs) (Raissi et al.,
2019) or the theory/physics guided neural networks (Daw et al.,
2017; Karpatne et al., 2017). For a more in-depth review of these
Fig. 4. Hybrid modelling structures.
hybrid modelling approaches the reader can refer to the review by
Karniadakis et al., 2021.
The neural network programming (NNP) approach proposed in
(Wang et al., 2010), Padé expansion (Tan and Li, 2002), extended this work is an integrated method that transcribes first princi-
Kalman filter (Sohlberg and Jacobsen, 2008), multivariate adaptive ples equations on the architecture of an ASNN (Fig. 4c)). Apply-
regression splines (MARS) (Duarte and Saraiva, 2003), multivariate ing NNP to a problem yields a new set of neural networks: Algo-
discrete-time models (Tulleken, 1993), principal component analy- rithmically Structured Neural Network (ASNNs). NNP simplifies the
sis (Destro et al., 2020), fuzzy systems (Van Lith et al., 2003), to ASNN training since the performance function does not need to be
name a few. modified. Utilizing NNP allows the formulation of models that are
The development of hybrid models can be done by using se- completely coherent with the physics framework and are not mere
rial or parallel hybrid modelling structures (Sansana et al., 2021b). approximations (as gradient-based or parallel structure methods).
Although in both modelling structures the core idea is alike, the These features make the NNP an ideal tool for fields like thermo-
mechanistic and the data-driven model interact differently as illus- dynamics or kinetic modelling where the consistency of the model
trated in Fig. 4. is as important as the accuracy.
In the serial hybrid modelling paradigm, the data-driven and ASNNs can be equivalent to serial hybrid models if the same
mechanistic models are set up in sequence (Fig. 4a)). The input model structure, variables, and information flow is utilized. Never-
variables are fed to the black box model thereafter fed to the theless, as opposed to the typical hybrid model paradigms, ASNNs
first principles model (the sequence between the black and white are automatically differentiable. Automatic differentiation is a tech-
boxes can also be reversed (Tsen et al., 1996)). The mechanistic nique in which the derivative of an operation is computed by eval-
model has the first principles equations (e.g., conservation laws) uating elementary arithmetic operations, thus, the numerical er-
while the black box is utilized as the empirical section of the ror is only caused by computational precision limits. Since the
model (e.g., mass transfer coefficients or kinetic constants). The pi- typical hybrid model structures are usually not automatically dif-
oneering work by Psichogios and Ungar (1992) for sequential hy- ferentiable, they perform numerical differentiations with an in-
brid models was firstly applied to fed batch bioreactors in order herent numerical error. The numerical error associated to nu-
to substitute the empirical parts of a kinetic model with an ANN. merical differentiation can be highly inaccurate due to round-off
Some other hybrid modelling examples have been used to pre- and truncation errors (Güneş Baydin et al., 2018). Even though
dict the pulp quality (Aguiar and Filho, 2001), kinetics of poly- for small processes with low amount of experimental data the
merization processes (Tian et al., 2001), fermentation processes computational efficiency might not be significant, for large pro-
(Bazaei and Majd, 2003), scaling of a pilot plant catalytic cracking cesses training an ANN might become economically prohibitive.
(Bollas et al., 2003), crystallization rates (Georgieva et al., 2003; Another important factor is the coding efficiency imposed by the
Meng et al., 2019; Nagy et al., 2020), process identification of an ASNNs framework, which compels the user to write the model us-
ethylene glycol process (Kahrs and Marquardt, 2008), analysis of ing matricial operations, hence, exploiting the parallel computing
fixed-bed reactor processes (Azarpour et al., 2017), cheese fermen- technology.
tation (Ebrahimpour et al., 2021). Although ASNNs and PINNs have similar conceptual ideas, the
The parallel hybrid modelling paradigm consists in formulat- main differences between the two is that PINNs account for the
ing a mechanistic model independently of the black box model physics and constraints through the performance function while
(Fig. 4b)). The data-driven model is used to predict the difference the ASNNs account for them through the architecture. This sug-
between the mechanistic model (with its own empirical parame- gests that ASNNs require fewer fitting parameters than PINNs
ters) and the experimental measurements. The first reported par- (which usually require several layers with dozens or hundreds of
allel model was done by Su et al. (1992) for the modelling of a fitting parameters). However, it is not fair to compare these two

4
A. Carranza-Abaid and J.P. Jakobsen Computers and Chemical Engineering 163 (2022) 107858

Fig. 5. Analogy between the thermodynamic entropy of an isolated system and the
parameter entropy in a) ANNs and b) mechanistic models.

approaches in this fashion since PINNs and ASNNs were designed


to solve different sets of problems.

2.4. Interpretability
Fig. 6. Equation decomposition algorithm. HO: hierarchy of operations.

The lack of interpretability in machine learning models might


be one of its main drawbacks. In order to address this issue,
two method categories for interpreting data-driven models have 3. Methodology
been reported model-specific methods and model-agnostic meth-
ods (Ribeiro et al., 2016). The model specific methods are those The NNP algorithm is divided in six steps, which are briefly de-
that can be understood or “read” by analyzing the model struc- scribed below and exemplified in Section 4 .
ture like the gradient booster algorithm XG-boost (Chen and Step 1. Setting the modelling objective and data collection. Defini-
Guestrin, 2016), the use of rule lists (Letham et al., 2015; Wang and tion of the phenomenon or process to be modelled, delimitation of
Rudin, 2015), or additive models (Caruana et al., 2015). The model the system boundaries, gathering of experimental data or auxiliary
agnostic methods evaluate the feature importance and, therefore, data that can aid the ANN training (e.g., vapor liquid equilibrium
are applicable to any machine learning method. For a more in- data may be useful if a flash separator is to be modelled).
depth review for interpretability, we recommend the following Step 2. Definition of the assumptions, physics laws and constraints.
Refs. Ribeiro et al. (2016) and Bikmukhametov and Jäschke (2020). Formulation of the first-principles equations and auxiliary equa-
The ANN training procedure heavily relies on random processes tions to be transcribed. Determination of the model inputs and
that induce entropy of modeling into the optimized parameters outputs in accordance with the Degrees of Freedom (DoF) of the
and, hence, reducing the chances to interpret the model. Random- first principles model. Defining a first principles system of equa-
ness can be observed in the selection of the datapoints used for tions inherently assumes a model structure, therefore, the first
training / validation, the initial values of the optimizable param- principles system of equations must be completely determined. Ac-
eters, or the optimization process (e.g., the stochastic random de- knowledging the assumptions and the limitations of the physics
scent method). Furthermore, the fact that the ANNs utilize generic equations is paramount in order to modify the ASNN in case that
equations of the same form (e.g., Eq. (1)) does not facilitate their the proposed structure does not meet the performance expecta-
interpretation. Because of this, it is unlikely to produce the same tions.
optimized parameter values in different runs. The highly entropic Step 3. Identification of the uncertain/surrogate sections of the
behavior of the ANN training can be compared to the molecular model. The selection of the empirical parameters/sections that are
movement in an ideal mixture in an isolated system (like in the to be substituted with a universal approximator substructure. A
Fig. 5a) sketch). Due to the high entropy associated to the molec- universal approximator substructure is any ANN with a shallow
ular movement, it is unlikely that the same molecule arrangement neural network or a deep neural network architecture. It should be
will be obtained even if both runs start at the same conditions remarked that the user must ensure that the uncertain/surrogate
(chances are 1 in 24 in this example). sections of the model completely satisfy the system of equations
Mechanistic models are expected to have a lower entropy since proposed in step 2.
the equation parameters are constrained by the first principles The user has the flexibility to either use traditional semi-
equations. This behavior can be compared to the molecular move- empirical parameters (e.g., saturation pressure) or new model
ment in a nonideal mixture in an isolated system (as illustrated parametrizations. The universal approximators can also be utilized
in Fig. 5b)). Although in principle, these molecules can “freely” as surrogates in order to simplify the modelling process by lump-
move, the molecule interactions can compel that some molecule ing several parameters into a single substructure.
arrangements are always formed (the fact that molecules associate Step 4. Equation decomposition. Every first principles equation
has been used for developing local composition activity coefficient must be decomposed into simpler equations so that they can be
models (Renon and Prausnitz, 1968; Wilson, 1964)). transcribed and subsequently arranged. The equation decomposi-
Considering the above, we hypothesize that an ASNN with an tion must follow the rules imposed by the artificial neuron signal
appropriate architecture will have a more uniform distribution of transformation process (e.g., Eqs. (1) - (2)). The decomposition al-
the optimized parameters (if trained several times) even if it is gorithm for an arbitrary equation f is presented in Fig. 6, where
trained with the highly entropic algorithms commonly used in X j is the input j fed to the equation Yk , and m is the total number
ANNs. Therefore, the proper construction of the ASNN might be of inputs fed to Yk (includes the bias/constant, if any). The function
equally or more important than the training procedure. The inter- f is defined by placing on the right side the part of the equation
pretability of ASNNs is further discussed in Section 4.1.2. to be decomposed. As seen in Fig. 6, the decomposition is done ac-

5
A. Carranza-Abaid and J.P. Jakobsen Computers and Chemical Engineering 163 (2022) 107858

cording to the hierarchy of operations (HO), in other words, in the 4. Results and discussion
same order as if the equation was being numerically evaluated in a
handheld calculator. Therefore, the decomposition should be done This section presents the application of the NNP method to two
from left to right, from the innermost to the outermost parenthe- study cases. The first case uses NNP to model a thermodynamic
ses, and multiplication operations have priority over the addition / vapor-liquid equilibrium (VLE) problem and the second study case
subtraction operations. Note that if function Y∗k is not encompassed applies the NNP to two-product separators.
in a nonlinear function, then Yk = g(Y∗k ) = Y∗k . The algorithm ends
when the right side of the first principles equation cannot be de- 4.1. Case study 1: ideal VLE system
composed anymore (i.e., when Yk = f ).
It must be remarked that divisions cannot be expressed as a/b 4.1.1. Application of the NNP method
in an ANN, but rather as a  (1  b). Therefore, divisions should Step 1. Setting the modelling objective and data collection. This ex-
be treated as nonlinear functions in NNP (this is valid for the Deep ample develops a VLE model for a benzene (1) and toluene (2)
Learning Toolbox of Matlab). system. A dataset containing the results of 200 simulations was
Step 5. Construction of the ASNN. A solution algorithm should utilized to train the ASNN. This dataset was generated by using
be formulated in order to solve the decomposed equations. In this the ideal thermodynamic package available in Aspen Plus v8.8.
work, we propose two alternatives for implementing the solution The molar fraction of component 1 (x1 ) and the temperature (T )
procedure: sequential solution algorithm and matrix inversion al- are the 2 independent variables and were randomly generated ac-
gorithm (both algorithms are further explained in Section 4 ). After cording to 0 < x1 < 1 and 333 K ≤ T ≤ 473 K. The output variables
the solution algorithm is proposed, the ASNN is constructed fol- taken from the Aspen Plus model were the partial pressure pi and
lowing the order of computation given by the solution algorithm. the total pressure P .
The input and output layers correspond to the inputs and out- Step 2. Definition of the assumptions, physics laws and constraints.
puts selected in steps 2 and 3. The inputs can be fed to the ASNN The physics laws and constraints to be considered are Raoult’s law
in different mathematical forms as long as they are not dependent (Eq. (3)), and Dalton’s law (Eq. (5))
on the computations done by the neural network (this is further
ln x + ln psat (T ) = ln p (3)
discussed in Section 4.2.1).
The algorithm shown in Fig. 6 automatically generates the so-
lution sequence of the decomposed equations. However, if more p = exp(ln p) (4)
than one function f is to be decomposed, then the user should
decompose each individual function and subsequently arrange all 
the decomposed equations. pi = P. (5)
It must be remarked that every vector/connection fed to a hid- i

den layer must have an associated weight matrix (W ) (e.g., if the Considering that the partial pressure in Eq. (3) is in logarithmic
decomposition of an arbitrary equation yields Yk = X1 + X2 , the form and the partial pressure in Eq. (5) is not, an auxiliary equa-
neural network will represent it as Yk = W1 X1 + W2 X2 ). tion (Eq. (4)) is needed to solve the system of equations. When-
Constructing the ASNN architecture is the only mandatory re- ever there is a subscript i, the equation refers to component i, oth-
quirement in this step. Nonetheless, minor structural modifications erwise, it refers to a vector (e.g., x is the vector of liquid molar
can be performed in order to simplify the ASNN architecture or compositions and xi is only for component i).
to constraint the predictions. The first type of minor modification Imprinting physics laws into the ASNN inherently assumes that
merges the output of a universal approximator with the equation the theoretical framework in which they were developed is cor-
that it is feeding in the decomposition. However, this can only be rect. Hence, the available DoF must be the same as in the mech-
done if the output of the universal approximator feeds exactly one anistic model. It was previously shown that ignoring the DoF for
equation. A second type of minor modification consists in adding modelling a VLE system (Gibbs’ phase rule) makes the ANN to
process constraints in order to help the ASNN to avoid computing find correlations foreign to equilibrium thermodynamics (Carranza-
unrealistic results if the model is extrapolated too far away from Abaid et al., 2020). The Gibbs’ phase rule determines the num-
the training conditions (e.g., avoid negative volumes). Minor mod- ber of intensive variables that can be independently selected. For
ifications are not limited to the ones listed above. These modifica- vapor-liquid non-reactive systems, the Gibbs’ phase rule is
tions are done as a function of each modelled system and are not
a requirement for successfully implementing NNP. DoF = 2 + n − π . (6)
Step 6. Training the ASNN. The parameters with low entropy are Where n is the number of components and π is the number of
those that help to transcript physics laws within the ASNN archi- phases. By evaluating Eq. (6), the DoF is 2.
tecture (identified in step 2) while the high entropy parameters are Step 3 Identification of the uncertain/surrogate sections of the
the ones used for modelling semi-empirical parameters (identified model. In this example, the source of entropy (or ignorance accord-
in step 3). Considering this, the low entropy parameters must be ing to the definition of entropy (Jaynes, 1957)) comes from the un-
fixed so that the physics equations can be properly transcribed. certainty about the form of the pure component saturation pres-
The fixed parameters can either be used to only transfer the sure equation. At low pressures, psat can be safely assumed to be
signal (e.g., if the identity matrix is used), to perform summa- only a function of T . Therefore, a shallow neural network function
tions of the input vector (e.g., if an all-ones row vector is used), (η) with the general form can be utilized
or to arrange an output vector according to the solution algorithm
ln psat
i = η (T ). (7)
(e.g., to set up the constant vector C for the matrix inversion algo-
rithm). Utilizing the weight parameter matrices for quite different Note that η is formed by two sequential equations.
purposes is one of the main strengths of the NNP algorithm. Con- Step 4. Equation decomposition. Since the equations utilized in
versely, the high entropy parameters are set free so that the ASNN this example are relatively simple, the decomposition procedure
can adjust to the data. does not change the form of Eqs. (3), (4), (5), and (7). Therefore,
Since the unfixed parameters have a high entropic behavior, it they can be directly transcribed into an ASNN.
is recommended to train the model several times and pick the set Step 5. Construction of the ASNN. This example applies the
of parameters that provides the best performance. functionality matrix approach from Book and Ramirez (1984),

6
A. Carranza-Abaid and J.P. Jakobsen Computers and Chemical Engineering 163 (2022) 107858

Fig. 8. Shallow neural network (SNN) architecture of the benzene (1) – toluene (2)
VLE system.
Fig. 7. Visual representation of the ASNN that represents the VLE.

• Layer LA : the set of parameters that correspond to this layer


Ramirez (1997) to Eqs. (3), (5), (7) to arrange the solution sequence is WA which should be an identity matrix of n x n dimen-
of the equations. The resulting functionality matrix is sions.
Eq. \ Var. T ln psat ln x ln p p P • Layer LD : WD is an all-ones vector of 1 x n dimensions so
(7 ) x x that its dot product with LA is equivalent to Dalton’s law
(3 ) x x x . (8) (Eq. (5)).
(4 ) x x The nonfixed parameters (W1 and W2 ) were trained afterwards
(5 ) x x using the Bayesian regularization algorithm (Hagan and Men-
Note that Var. stands for variable. The x marks indicate that the haj, 1994; MacKay, 1992; Marquardt, 1963). The deep learning tool-
variables are present in the equation that matches the given row. box from Matlab 2020b was used for implementing and optimizing
For instance, ln psat , ln xi and ln pi are related by Eq. (3). the ASNN.
i
The functionality matrix indicates that the solution procedure The data was divided using the default parameters from Mat-
should evaluate, in order, Eqs. (7), (3), (4) and (5). The resulting lab (70% for training, 15% for validation and 15% for testing). The
ASNN is illustrated in Fig. 7 and given by the following set of equa- learning rate was set to 0.001 and the weights of each observation
tions were divided by the square of the experimental value in order to
emulate an AARD performance function. No special hyperparame-
L1 = tanh(W1 T ∗ + β1 ) [3] (9) ter tuning was required.
The NNP approach showed adequate representation of the ideal
L2 = W2 L1 + β2 [n] (10) VLE system. The absolute average relative deviation (AARD) of the
partial pressures p1 and p2 were estimated to be 0.09% and 0.08%
respectively while for total pressure P is 0.10%. One of the main
LR = WR,1 ln (x ) + WR,2 L2 = ln( p) [n] (11) features of NNP is that it is possible to retrieve and analyze the
parameters calculated inside the ASNN. In this case, if the out-
put values of L2 are analyzed, one can notice that they correspond
LA = exp(WA LR ) [n] (12)
to those of ln( psat ), hence, the values of psat can be compared to
those calculated with Aspen Plus. In this example, the AARD of psat
1
LD = ln (|WD LA | ) = ln(P ) [1]. (13) and psat
2
are 0.12% and 0.15%, respectively. These values are satis-
factory considering that the ASNN was not explicitly trained for
Neither the input nor the output variables are normalized, how-
modelling these values.
ever, the absolute value of T is reduced according to T ∗ = T /10 0 0.
Layers L1 and L2 are the representation of the shallow neural 4.1.2. Entropy of the modelling process
network substructure that implicitly calculates the logarithm of the It was discussed in Section 2.4 that providing a structure to an
saturation pressure as a function of T ∗ . According to Eq. (3), ln(x ) ANN can reduce the entropy of the optimization process. Thus,
and ln( psat ) must be linearly combined in order to estimate ln p. sharpening the distribution of the fitting parameters. We tested
The auxiliary equation (Eq. (4)) is represented in layer LA , and Dal- this hypothesis by comparing the parameter distribution values of
ton’s law corresponds to layer LD . Note that Layer L2 could have the ASNN (Fig. 7) and a shallow neural network (SNN) with 4 ar-
been merged with the Raoult’s law. tificial neurons (see Fig. 8). In this study, the parameters fixed in
The ASNN could have utilized 4 layers if physics laws were not step 6 (e.g., WA or WD ) were fitted by the ANN training algorithm
represented in their logarithmic form. This emphasizes that an ad- instead of being fixed. Moreover, biases were added toEqs. (11)–
equate selection of the input and output variables is essential since (13) to obtain, respectively,
it can needlessly complicate the ASNN architecture.
The total number of sets of parameters (W ) in Eqs. (9)– LR = WR,1 ln (x ) + WR,2 L2 + βR [n] (14)
(13) must be equal to the number of connections that feed a hid-
den layer (six in this example: W1 , W2 , WR,1 , WR,2 , WA, and WD .
LA = exp(WA LR + βA )[n] (15)
The number of neurons in layers L2 , LR , LA , and LD is equal to the
number of components (n) since Raoult’s law is used n number of
times (n = 2 in this example). The number of neurons in layer L1 LD = ln (|WD LA + +βD |)[n]. (16)
was determined by a trial-and-error procedure and it was found
The biases were only added for the purpose of this study. The
that using 3 neurons provided adequate prediction capabilities of
ASNN trained in this study is formed by Eqs. (9), (10),(14), (15),
the model without incurring into overfitting.
and (16). The optimized parameter distribution was analyzed by
Step 6. Training the ASNN. Since layers LR , LA and LD repre-
training both neural networks 10 0 0 times with the same dataset
sent two physics laws and an exact mathematical relationship, the
used in Section 4.1.1 .
transformation functions and their parameters are fixed as follows:
(Fig. 9a) shows the empirical probability distribution of the
• Layer LR : the set of parameters available in this layer are weight parameters of the second layer of the shallow neural net-
WR,1 and WR,2 . Both weight matrices should be equal to an work (W2∗ ). This distribution is broader and flatter than the opti-
identity matrix with n x n dimensions. mized parameter distributions of the ASNN (Fig. 9b and d). These

7
A. Carranza-Abaid and J.P. Jakobsen Computers and Chemical Engineering 163 (2022) 107858

Fig. 9. Histograms of the empirical probability of the optimized parameter values of: a) W2∗ (shallow neural network), b) WR,1 , c) WR,2 and d) combined effect of β A and
layer WD weights.

characteristics imply that there is a low likelihood that the same sions from it. The results of Fig. 9(c) indicate that WR,2 do not in-
fitting parameter values will be obtained during different training terfere with the prediction capabilities. The accurate predictions of
procedures. Therefore, the W2∗ parameters have high entropy and the UNN together with the fact that the WR,2 and βD values agree
consequently low interpretability. It is not possible to obtain def- with those of the FNN, indicate that these models, despite being
inite conclusions from their numerical values despite having good different, perform equivalent calculations. For example, due to lack
agreement with the dataset (the AARD of all shallow neural net- of information in the ASNN, the parameters WR,2 , and βR become
works is between 0.05 and 0.25%). If more artificial neurons are highly dependent on the parameter values of W2 and WR,2 . In order
utilized in the SNN, more variability of the fitting parameters are to interpret this, let us combine Eqs. (10)–(14)
expected, hence, the chances of interpreting overfitted ANNs are
even lower than non-overfitted ANNs.
LR = WR,1 ln (x ) + WR,2 (W2 L1 + β2 ) + βR (17)
The empirical probability distribution of the optimized param- Eq. (17) indicates that the product WR,2 (W2 L1 + β2 ) has the
eters of the ASNN are presented in Fig. 9b–d). Fig. 9(b) shows two condition of having infinite number of solutions. This condition
highly sharp distributions that correspond to WR,1 . These distribu- suggests that the parameter distribution of these two sets of pa-
tions indicate that there is an 80% chance that the optimized WR,1 rameters should be relatively flat (as shown in Fig. 9(c)). This signi-
parameter values are either 0 or 1. It is worth mentioning that the fies that the parameter WR,2 has high entropy, therefore, the terms
models whose WR,1 parameters are 0 and 1 have an AARD less than WR,2 (W2 L1 + β2 ) + β3 in Eq. (17) become equivalent to ln ( psat ).
1%. This suggests that the optimized parameters which have the The difference between the ln ( psat ) values and the values cal-
highest empirical probability are the ones with better predictions. culated from WR,2 (W2 L1 + β2 ) + β3 is quite low (0.03–0.06%). This
From now on, we refer to the model developed in suggests that the ASNN “knows” that ln ( psat ) must be calculated,
Section 4.1.1 as the fixed parameter neural network (FNN) while however, it is difficult to interpret it because we provided a se-
the models developed in this section are called unfixed parameter quence of operations with poor interpretability.
neural networks (UNN). There is a similar effect in the Dalton’s law section of the ASNN.
If one analyzes the parameters found in W2 , WR,2 , WD , βR , The bias βA (constant) in Eq. (15) is transformed to its exponential
and βA , it is not possible to obtain definite conclusions about the form and multiplies the parameters in WD . If the overall interaction
parameter behavior due to their blunt distribution. For example, of these parameters is analyzed, two narrow and definite distribu-
Fig. 9(c) shows a broad and flat empirical probability distribu- tions can be seen (Fig. 9(d)). By only considering the UNN with an
tion of the WR,2 parameters. The characteristics of the distribution AARD < 1%, there is a 50% probability that the optimized parame-
shown in Fig. 9(c) resemble those of a random probability distri- ter is -1 and 50% that it is 1. This apparent disagreement with the
bution, hence, neglecting the possibility to draw objective conclu- FNN (where all parameters should be equal to +1) is caused by

8
A. Carranza-Abaid and J.P. Jakobsen Computers and Chemical Engineering 163 (2022) 107858

The operation  /F was performed in order to use scalable vari-


ables instead of total molar flows. Therefore, the “reduced” vapor
flow v∗i is independent of the magnitude of F (valid for systems in
equilibrium). The reduced mass balance for each component in the
liquid phase is
li∗ = zi∗ − v∗i (21)
According to the first principles, the DoF are 4 (T ,  and 2 in-
dependent molar fractions).
Step 3. Identification of the uncertain/surrogate sections of the
model. Although the λi parameters are not explicitly semi-
empirical, they are calculated in mechanistic models with the
product of semi-empirical models (e.g., activity coefficient models
or pure component saturation pressure), hence, this is a source of
entropy. This means that λi can be calculated with a universal ap-
proximator substructure. Consequently, a shallow neural network
can be used to predict the λ vector
λ = η (z, T , ). (22)
The Gibbs’ phase rule establishes the number of possible inputs
to η in Eq. (22).
Fig. 10. Diagram of the modelled two-phase separator.
Steps 4 - 5. Equation decomposition and construction of the ASNN
architecture. Since Eqs. (20)–(22) are somewhat complex, it is nec-
the fact that the argument of the logarithmic function uses abso- essary to break down the equations and organize the solution al-
lute values, thus making -1 and 1 equivalent. gorithm. The detailed procedure of the equation decomposition,
structural analysis, and equation arrangement of this example can
be found in the supplementary information.
4.2. Case study 2: two-product separator
The ASNN that models the two-phase separator is illustrated in
Fig. 12 and is described by the following set of equations
4.2.1. Sequential solution algorithm
Step 1. Setting the modelling objective and data collection. The ob- L1 = tanh(W1 I1 ) [ p] (23)
jective of this example is to develop a model of a vapor-liquid sep-
arator of a multicomponent non-ideal mixture (see Fig. 10). This L2 = (W2,1  ∗ )  (W2,2 L1 ) [n] (24)
model is expected to predict the molar flows of the components in
each product stream. The components present in the mixture are
methanol (1), ethanol (2) and water (3). L3 = 1  (W3,1  + W3,2 L2 ) [n] (25)
A dataset containing the results of 50 simulations was utilized
to train the ASNN. This dataset was generated with the equilibrium L4 = (W4,1  )  (W4,2 z )  (W4,3 L3 ) [n]
two-phase separator model and the NRTL thermodynamic pack-
age available in Aspen Plus v8.6. The input variables used were
the molar fractions in the feed stream zi (0 ≤ zi ≤ 1), the separator L4 = 0 if L4 < 0 [n] (26)
temperature T (343K < T < 413 K) and the vaporization fraction 
(0 ≤  ≤ 1). The values of the independent variables zi and T were
randomly generated. L5 = W5,1 z + W5,2 L4 [n]
Step 2. Definition of the assumptions, physics laws and constraints.
The mass conservation equation for each component in the vapor-
liquid separator can be written in terms of molar fractions and the L5 = 0 if L5 < 0 [n] (27)
separation or vaporization fraction ( = V/F ):

zi F = yi  F + xi (1 −  )F (18) L6 = W6,1 z + W6,2 L5 [n], (28)


where I1 is a vector containing z, T and  , and ∗= 1 − .
where, zi , yi and xi are the component i molar fractions in the feed,
Layer L1 and the operation (W2,2 L1 ) in L2 is the shallow neu-
vapor, and liquid streams, respectively. The total molar flows of the
ral network substructure (representing Eq. (22)). Eq. (20) was de-
feed, vapor and liquid streams are respectively F , V and L. Assum-
composed into smaller operations (Eqs. (24)–(26)). For instance,
ing thermodynamic equilibrium, the molar fraction of the vapor
(W2,1  ∗ )  (W2,2 L1 ) in Eq. (24) is equivalent to the λ(1 −  ) term
and the liquid phases should hold the following relationship (in-
in Eq. (20). Layers L2 to L6 have 3 neurons each because there are
dependently of the magnitude of F )
n = 3 components and L1 has p =8 neurons (adjusted by a trial and
xi
λi = . (19) error procedure).
yi The homogeneous characteristic of the ASNN (Fig. 11) comes
Where λi is the inverse of the distribution coefficients (k- from the fact that z is used as input in I1 in order to estimate λ
values) commonly used in vapor liquid equilibrium calculations. By (because of the equilibrium assumption). If the molar flows f were
combining Eqs. (18), (19) and multiplying the result by  /F , the utilized instead of z, it would mean that the ASNN is assuming
following equation is obtained a non-homogeneous behavior, therefore, the system could not be
  in thermodynamic equilibrium. Conversely, if the input z is sub-
1 stituted with the molar flows f , the predicted values molar flows
v ∗
= yi  = zi  . (20)
i
 + λi (1 −  ) would not be reduced molar flows (i.e., v and l instead of v∗ and
l ∗ ).

9
A. Carranza-Abaid and J.P. Jakobsen Computers and Chemical Engineering 163 (2022) 107858

Fig. 11. ASNN architecture of the two-product separator using the exact first principles representation.

Step 6. Training the ASNN. The architecture of the ASNN is com- model for flash separation of multicomponent mixtures. Utilizing
posed of 4 input layers, 6 hidden layers and 2 output layers. The ASNN for overriding the need for iterative calculations may speed-
hidden layers L1 and L2 contain adjustable parameters while layers up complex equilibrium calculations with several components.
L3 , L4 , L5 and L6 containg fixed parameters layers (some of the pa- It is remarkable that the ASNN performance is so high despite
rameters in L2 are fixed and some are adjustable). In order to rep- the fact that a small dataset (only 50 simulations) made with
resent the first principles equations, the following considerations a random sampling method was used. In the context of surro-
must be done: gate modelling, this suggests that using NNP together with more
advanced sampling methods (e.g., (Eason and Cremaschi 2014;
• The parameter weight matrices W3,2 , W4,2 , W4,3 , W5,2 , and Nuchitprasittichai and Cremaschi 2013)) can provide more extrap-
W6,2 should be the identity matrix with 3 × 3 dimensions olable and robust models.
for this ternary mixture. It is important to note that strictly applying the NNP method
• The parameter weight matrices W5,1 and W6,1 are the nega- would yield an ASNN with 5 hidden layers instead of 6. The pur-
tive identity matrix with 3 × 3 dimensions. pose of adding layer L6 was to verify the robustness of the NNP
• The parameter weight vectors W2,1 , W3,1 and W4,1 should be method. Layer L6 computes v∗ a second time (note that v∗ is com-
a unit vector with 3 × 1 dimensions. puted in L4 as well) in order to ensure that vector v∗ does not
• Only the inputs in layer I1 are normalized using the “map- provide negative values. It was observed that the results do not
minmax” function from Matlab (transforms all variables to substantially change if the last layer is not added, in fact, if the
values between -1 and 1). Conversely, the remaining layers mass balances are not solved twice in the ASNN, only 0.008% of
should not be normalized. the solutions have an error in the mass balance larger than 10−15
while in the case with 6 layers is 0%. Therefore, it can be deter-
The data used to train the model was divided in the training, mined that the NNP method is robust and it is not necessary to
validation, and test sets. The training set has 70% of the data points substantially deviate from the core method.
while the validation and test sets are 15% each. In order to dis- As discussed earlier, ASNNs can be applied for fitting of ex-
tribute the error evenly across all the datapoints, it was neces- perimental or process data where noise is expected. In order to
sary to weight the observations. The datapoint weights for l ∗ and test the proposed model, randomly distributed noise was added to
v∗ were calculated with (1/l ∗ )2 and (1/v∗ )2 respectively. Due to the training dataset (the validation dataset remains without noise).
the stochastic nature of the neural network training algorithms, The results of training a model with a noisy data set are shown
the ASNN was trained 100 times and subsequently the model in Table 1 (ASNN, ±5 % and ASNN, ±20 %). The noise was added
with the lowest AARD was selected. A learning rate of 0.0 0 01 was to the reduced molar flows l ∗ and v∗ , then the values of x, y, 
utilized. and λ were recalculated. It can be observed that, as expected, the
An extra validation dataset consisting of 10 0,0 0 0 datapoints average error increases when the noise in the datapoints is in-
was generated in the same fashion as the training dataset. This creased. Despite the aggregated noise, the difference between the
was done to test the generalization capabilities of the ASNN on trained ASNN and the validation dataset remains within a reason-
the entire parameter space. The AARDs of important process vari- able range.
ables are presented in Table 1 (under the column ASNN, ±0 %). The A hybrid model based on the serial paradigm was developed in
molar fractions were calculated by normalizing the corresponding order to compare it with the ASNN. The columns labeled as “SHM,
l ∗ and v∗ . The difference between the ASNN and Aspen’s calcula- ±5 % and SHM, ±20 % in Table 1 report the deviation of models
tions is small considering that few datapoints (50 runs) were used generated by a serial hybrid model framework (Fig. 4a). The λi pa-
for the training. It is possible to have better predictions if more rameters were fitted through a shallow neural network (SHM) with
training data is used. For example, if using 300 datapoints, the av- 8 artificial neurons (same as in the ASNN). The reduced product
erage error can be reduced to 0.2%. In view of the small AARD, it is flows l ∗ and v∗ were calculated with the rigorous mass balances
reasonable to consider the proposed ASNN as a suitable surrogate while the molar compositions (x and y) were calculated by nor-

10
A. Carranza-Abaid and J.P. Jakobsen Computers and Chemical Engineering 163 (2022) 107858

Table 1
AARD (%) between the neural networks and the extra validation dataset when using different
noise levels in the training dataset.

NNP – based hybrid models Serial – based hybrid models


Variable
ASNN, ±0 % ASNN, ±5 % ASNN, ±20 % SHM, ±5 % SHM, ±20 %

l 1.4 3.1 5.0 4.4 8.6
v∗ 0.5 1.7 4.0 2.0 6.7
x 1.1 2.9 4.6 4.1 7.5
y 0.4 1.6 3.8 1.9 6.3
Average 0.9 2.8 4.3 3.1 7.3

Fig. 12. Additional processes that can be modelled by utilizing the ASNN architecture shown in Fig. 11. (a) Stripper column, and (b) Biogas upgrading process.

malizing l ∗ and v∗ . The NNP-based models outperform the predic- The process shown in Fig. 12(b) has two inlet components in
tion capabilities of the coupled hybrid models. In cases where the the feed (the raw biogas has C H4 and C O2 ) and 6 process parame-
noise is high (±20 %), the AARD of the ASNNs can be 40 to 50% ters which are the reboiler temperature (TR ), reboiler pressure (PR ),
lower than when compared to the serial hybrid models. We con- solvent flow (S), solvent temperature (TS ), absorber pressure (PA )
sider that there are two causes of the superior performance of the and biomethane to raw biogas ratio ( ). Therefore, the layer I1 has
ASNNs over the serial hybrid models. The first cause is due to the 8 input parameters and layers L2 to L7 have 2 neurons. The ASNN
ASNN architecture which helps the optimization process. Secondly, were trained by utilizing data generated from models validated in
the numerical differentiation error negatively affects the optimiza- previous works (Carranza-Abaid et al., 2021; Carranza-Abaid and
tion step of the serial hybrid model. Jakobsen, 2021, 2020). The same ASNN architecture was used for
It is worth mentioning that the proposed ASNN structure can both processes and showed an AARD of 1,2% in both cases (8 arti-
also work for a P  flash problem as well. The only difference ficial neurons were used in L1 , however, the AARD can be lower if
would be the input layer I1 , where T should be substituted by P . more parameters are included).
On the other hand, the solution of a P T flash would require a dif- There are important points that must be considered when ap-
ferent ASNN architecture. Due to the characteristics of the P T flash plying the two-product separator ASNN architecture to other pro-
problems, it has been suggested in the literature (Poort et al., cesses:
2019), to use regression and classifier neural networks together to • In cases where the process is quite complex or there is more
solve a P T flash problem. The authors reported that over 25 million available data, additional layers can be inserted in between
datapoints were used to train the model for a binary mixture. This or before layer L1 in Fig. 11. Adding more layers should not
points out that using the NNP approach for a T P flash will require modify layers L2 to L7 (except for the number of neurons
designing a possibly more complex architecture. which must be equal to the number of components).
The generality and easy adaptability of ASNN structures to other • As opposed to the flash process, the λi parameters might
processes is a practical advantage. In particular, the hidden lay- not have a formal definition in other processes (e.g., distil-
ers with fixed parameters (layers L2 to L6 ) will be the same as lation or gas stripping). Despite this, the λi parameter in-
in Fig. 11. This means that as long as the modelled system has a dicates that, independently of the process, there is always
feed stream and two product streams, the ASNN structure can be a relationship between the component compositions of the
utilized independently of the modelled process. For example, the two products that is independent of the extensive properties
same ASNN architecture was utilized for modelling the processes of the system.
shown in Fig. 12. The ASNN architecture for modelling the flash • If the process is modelled with the ASNN structure but the
separator and the stripping process (Fig. 12(a)) are quite similar, mass balances do not behave homogeneously, then the pro-
since the only differences are the involved components (the rich posed ASNN architecture might not be optimal. However,
amine stream has CO2 , MEA and H2 O) and the input variables to knowing that the system does not behave homogeneously is
layer I1 (molar fractions (z), temperature in the reboiler (TR ), feed also valuable modelling information that can be utilized in
temperature (TF ), and the bottoms to feed ratio ( )). order to modify the ASNN to account for extensive variables.

11
A. Carranza-Abaid and J.P. Jakobsen Computers and Chemical Engineering 163 (2022) 107858

4.2.2. Matrix inversion algorithm Table 2


Characteristics and results of every ASNN2 architecture formulated.
The application of NNP using a sequential algorithm is feasible
for relatively simple models. However, for moderate to large pro- ID Universal approximator architecture No. of datapoints AARD / %
cesses, solving conservation equations sequentially is not ideal. To 1 8 50 12
overcome this conundrum, this section utilizes a matrix inversion 2 16 50 23
solution algorithm to solve the linear equations like mass balances. 3 8 500 2.0
Steps 1 and 2 of the NNP method are the same as in Section 4.2.1 . 4 16 500 1.2
5 8×4 50 10
Step 3. Identification of the uncertain/surrogate sections of the
6 16 × 4 50 6.4
model. The high entropy section (η) of the ASNN is set to substi- 7 16 × 8 50 6.7
tute the square bracketed term in Eq. (20). Therefore, the reduced 8 32 × 8 50 8.2
vapor molar flows are given by 9 16 × 4 500 1.0
10 16 × 8 500 0.9
v∗ = z    η (z, T , ). (29) 11 32 × 8 500 0.8

Steps 4 - 5. Equation decomposition and construction of the ASNN


architecture. Applying the matrix-inversion solution algorithm to
the construction of an ASNN must be done in two parts. The first
part requires to organize the non-linear equations according to L4 = (W4 L3 ) = M−1C if 0 < L4 < 1 [2n]
a sequential solution algorithm. The second part uses the values
computed by the nonlinear sections of the model to solve the lin-
L4 = 0 if L4 < 0 [2n]
ear system of equations given by the mass balances.
To solve the linear system of equations, the use should repre-
sent in a matricial form such that
L4 = 1 if L4 > 1 [2n] (35)
A = M−1C, (30)
Layer L2 is composed of n = 3 neurons because it predicts v∗1 ,
where A is the solution vector containing the mass/mole flows, M v∗2 and v∗3 . L3 and L4 have 6 neurons because they carry the in-
is the characteristic mass balance matrix, and C is the constant formation of both li∗ and v∗i . Fig. 13 and Eq. (33) show a product
vector. between the universal approximator predictions (composed of L1
In this example, the constant C must be given by the output of and part of L2 ) with  and z. The advantage of using Eq. (29) over
the nonlinear section of the model (v∗ ) and the known input mass Eq. (20) is that it is no longer assuming equilibrium between both
flow variable (z). Considering this, the mass balance matrix M and phase. Eq. (29) guarantees that: in the absence of a vapor phase no
its corresponding constant vector C can be expressed as vapor molar flows different than 0 will be predicted, and a positive
⎡ ⎤ ⎡ ∗⎤
1 0 0 0 0 0 v1 component molar flow cannot be predicted if it is not present in
⎢0 1 0 0 0 0⎥ ⎢v∗2 ⎥ the feed.
⎢0 0 1 0 0 0⎥ ⎢v ∗ ⎥ Optional modification to the ASNN: In order to aid that the pre-
M=⎢
⎢1
⎥, C = ⎢ 3⎥ (31)
0 0 1 0 0⎥ ⎢ z1 ⎥ dicted v∗ and l ∗ values are between 0 and 1, the linear satura-
⎣ ⎦ ⎣ ⎦
0 1 0 0 1 0 z2 tion nonlinear function (satlin in Matlab) was added to L2 and L4
0 0 1 0 0 1 z3 Eqs. (33) and (35).
Step 6. Training the ASNN. In order to transcript Eq. (29), the
The first three columns of M correspond to v∗i while the last
weight parameter matrix W2,1 was fixed to be an all-ones vertical
three columns to li∗ . Therefore, the first three rows of M ensure
vector with n elements and W2,2 was fixed to an identity matrix
that v∗i is equal to the output of the nonlinear section of the model
with n x n elements. In order to construct the C vector, the follow-
and the last three rows guarantee that fi∗ = v∗i + li∗ (decomposed
ing weight parameter matrices were defined
version of Eq. (21)). In order to apply this approach to other pro-
cesses it is necessary to update the mass balance matrix M, the 0 0 0 1 0 0
constant vector C and the variable vector A in accordance with the 0 0 0 0 1 0
process flowsheet and the components involved. 0 0 0 0 0 1
W3,1 = , W3,2 = . (36)
Once the M and C are identified, the construction of the ASNN 1 0 0 0 0 0
consists in first placing all the hidden layers corresponding to 0 1 0 0 0 0
the nonlinear section of the model (already organized) and then 0 0 1 0 0 0
adding 2 layers afterwards. The first extra layer must utilize the
The purpose of W3,1 is to bypass the values of z to the lower
weight parameters to construct the constant vector C and the sec-
part of vector C while W3,2 transfers the values estimated by Layer
ond extra layer must transcript Eq. (30). Therefore, the parameter
L2 in order to formulate the upper part of vector C. The parameter
matrix of the second layer must always be equal to M−1 .
matrices in layer L3 are used as mathematical artifices to formulate
The ASNN architecture is shown in Fig. 13 and the set of equa-
the C vector needed to solve the linear system of equations.
tions is
In order to understand the behavior of the ASNNs as a function
L1 = tanh(W1,1 I1 + β1 ) [ p] (32) of the number of training parameters, several ASNNs models were
formulated. The results are presented in Table 2. The universal ap-
L2 = (W2,1  )  (W2,2 z )  (W2,3 L1 ) if 0 < L2 < 1 [n] proximator substructure indicates how many hidden layers with a
sigmoid function were utilized (Eq. (32)). For example, model #1
has one hidden layer with 8 neurons while model #7 has 2 lay-
L2 = 0 if L2 < 0 [n] ers with sigmoid functions (one with 8 neurons and one with 4
neurons which means that there is an additional layer before L2 ).
The universal approximator substructures are connected to layer L2
L2 = 1 if L2 > 1 [n] (33)
(Eq. (33)). The models with 50 datapoints used the same training
database as in Section 4.2.1 while the ones with 200 and 500 dat-
L3 = (W3,2 z ) + (W3,1 L2 ) = C [2n] (34) apoints utilized new databases.

12
A. Carranza-Abaid and J.P. Jakobsen Computers and Chemical Engineering 163 (2022) 107858

Fig. 13. ASNN architecture of the two-product separator using the matrix inversion solution algorithm.

Each model was trained 20 times and the model with the low- hesion between the data, one should explicitly represent all the
est AARD was selected. The weights of the datapoints were calcu- first-principles relationships.
lated with (1/l ∗ )2 and (1/v∗ )2 in order to optimize for the AARD
instead of the mean square error. Note that the input layer I1 that 5. Conclusions
feeds the universal approximator substructure uses a normalized
input while input layers I2 and I3 do not. The models were trained This work presents the Neural Network Programming (NNP)
using the Bayesian stochastic optimization algorithm available in hybrid modelling paradigm. It integrates a first principles mod-
the Deep Learning Toolbox of Matlab 2020b. elling approach with a data-driven algorithm. The models devel-
The AARD presented in Table 2 was calculated with the same oped with NNP are Algorithmically Structured Neural Networks
extra validation dataset used in Section 4.2.1. Comparing the AARD (ASNNs). The idea is to generate a first principles model, decom-
(models 1 and 2 in Table 2) against the results of Table 1 shows pose it and transcript the mathematical equations to an ASNN. This
that utilizing the ASNN1 (Fig. 11) has, in general, better predic- allows the ASNN to be physically coherent over the entire solution
tion capabilities than the ASNN2 (Fig. 13). This is understand- space, including limit cases (e.g., ASNNs do not predict a positive
able since the ASNN1 architecture has more information regard- molar flow of a component if it is not present in the mixture). Due
ing the relationship between the input and output variables than to the features of ASNNs, there is no need to utilize sophisticated
the ASNN2. This reveals a clear tradeoff between the ASNN com- performance functions in order to account for physics constraints
plexity and its prediction capabilities. The more physics informa- and the hyperparameter tuning is not critical as in other data-
tion is provided to the ASNN architecture, the less data and fit- driven modelling techniques. Since the first principles equations
ting parameters will be needed; thus, the model will have a lower are in-built in the ASNN, the data is more efficiently utilized for
entropy. In fact, adding more parameters without including more fitting process parameters rather than trying to rediscover physics
datapoints is detrimental to the model performance since it over- concepts that must hold. This causes the ASNNs to exhibit superior
fits the model to the training data (compare models #1 and #2 performance than black-box and conventional serial hybrid model
in Table 2). One can notice that in order to have a similar AARD configurations.
between the ASNN1 and ASNN2, the number of datapoints is in- Three examples on how to model chemical engineering prob-
creased roughly tenfold (from 50 to 500) and the number of neu- lems with the NNP method are presented in this work. The first
rons in the hidden layer is increased two times (from 8 to 16). example consisted in the rigorous representation of the physics
The type of universal approximator substructure can be differ- equations and relationships in the same way as how it is done in
ent from a shallow neural network. In fact, Table 2 shows that the a mechanistic model (Section 4.1). Being able to transcript physics
model AARD can be reduced 50% if a deep neural network sub- laws within an ASNN is of utmost importance in subfields where
structure that has 2 hidden layers is utilized. Moreover, better pre- the models must comply with a large set of requirements in order
diction capabilities can be achieved than in the ASNN1 model if to be deemed as “correct” (e.g., property modelling or thermody-
more datapoints and a deep neural network substructure is uti- namics). The second example (Section 4.2.1) shows a more relaxed
lized. representation of the first principles equations. It is highlighted
It should be remarked that applying NNP with exact first prin- that through a structural analysis of the first principles model it is
ciples representations for every equipment in a large process can possible to lump several parameters onto a single parameter with-
be a time-consuming task, therefore, in many cases a compro- out disrupting the reliability of the model. The third example (sec-
mise between accuracy and practicality will emerge. For exam- tion 4.2.2) presents an alternative approach for performing mass
ple, in a process with multiple unit operations and subprocesses balances where the parameter matrix of a hidden layer is used to
and large amount of data, it is more practical to apply the in- solve mass balance systems. This approach is of fundamental im-
verse matrix approach since the mass balances will be solved in portance in the application of NNP to large processes since it can
a single step instead of multiple layers. On the other hand, for sys- be used to model processes with several process streams.
tems with limited amount of measured data or where keeping cer- There is a compromise between the complexity of the ASNN ar-
tain mathematical relationships is paramount to maintain the co- chitectures and the amount of physics knowledge embedded into

13
A. Carranza-Abaid and J.P. Jakobsen Computers and Chemical Engineering 163 (2022) 107858

them. Therefore, a careful assessment of the assumptions and the we are grateful for the comments and suggestions provided by
information that wants to be predicted with the model must be the reviewers during the peer-review process. This research was
done before constructing the ASNN. Performing a structural anal- funded by the Faculty of Natural Sciences of the Norwegian Uni-
ysis allows the user to select the parts of the model that can be versity of Science and Technology (NTNU).
substituted with a universal approximator substructure and how
to connect them with the rest of the model. As shown in the two-
product separator example, a structural analysis allows ASNNs to Supplementary materials
be effectively utilized for the formulation of surrogate models by
removing the iterative loops that are commonly seen in mechanis- Supplementary material associated with this article can be
tic models. NNP allows the user to detect sections of the model found, in the online version, at doi:10.1016/j.compchemeng.2022.
that can be substituted by a universal approximator and how to 107858.
connect it to the rest of the model.
An interesting feature of ASNNs is their transferability between References
processes with akin characteristics. For example, this work pre-
sented the ASNN architecture needed to model a flash separator Aguiar, H.C., Filho, R.M., 2001. Neural network and hybrid model: a discussion about
which was later applied to model biogas upgrading processes with- different modeling techniques to predict pulping degree with industrial data.
Chem. Eng. Sci. 56, 565–570. doi:10.1016/S0 0 09-2509(0 0)0 0261-X.
out substantial modifications to the original ASNN architecture. Åkesson, J., Årzén, K.E., Gäfvert, M., Bergdahl, T., Tummescheit, H., 2010. Modeling
In contrast to the typical hybrid configurations, NNP only relies and optimization with Optimica and JModelica.org-Languages and tools for solv-
on using a single auto-differentiable model that complies with the ing large-scale dynamic optimization problems. Comput. Chem. Eng. 34, 1737–
1749. doi:10.1016/j.compchemeng.2009.11.011.
physics laws rather than independently utilizing a first principles Azarpour, A., Borhani, N.G., R, T., Wan Alwi, S., Manan, A., I, Z, Abdul Mutalib, M.,
model and an artificial neural network model. This allows a more 2017. A generic hybrid model development for process analysis of industrial
accurate and faster training because no numerical differentiation fixed-bed catalytic reactors. Chem. Eng. Res. Des. 117, 149–167. doi:10.1016/j.
cherd.2016.10.024.
techniques are utilized. Moreover, and as opposed to mechanistic
Bazaei, A., Majd, V.J., 2003. Feedback linearization of discrete-time nonlinear un-
models, the NNP framework directs the user to develop models in certain plants via first-principles-based serial neuro-gray-box models. J. Process
matricial form, which means that multiple simulations can be uti- Control 13, 819–830. doi:10.1016/S0959-1524(03)0 0 027-1.
Bikmukhametov, T., Jäschke, J., 2020. Combining machine learning and process en-
lized with a single call to the model function (i.e., instead of uti-
gineering physics towards enhanced accuracy and explainability of data-driven
lizing for or while cycles). We expect that these features will be models. Comput. Chem. Eng. 138. doi:10.1016/j.compchemeng.2020.106834.
exploited even further in the coming decades due to the current Bishop, C.M., 2006. Pattern Recognition and Machine Learning. Pattern Recognition
advances in quantum computing. and Machine Learning. Springer Science+Business Media LLC.
Bogusch, R., Lohmann, B., Marquardt, W., 2001. Computer-aided process model-
Interpreting the parameters of a black-box ANN might be an ing with ModKit. Comput. Chem. Eng. 25, 963–995. doi:10.1016/S0098-1354(01)
overcomplicated task due to the high variability of the optimized 00626-3.
parameters. In other words, it is unlikely to find a physical mean- Bollas, G.M., Papadokonstadakis, S., Michalopoulos, J., Arampatzis, G., Lappas, A.A.,
Vasalos, I.A., Lygeros, A., 2003. Using hybrid neural networks in scaling up an
ing of parameters whose numerical values are a consequence of FCC model from a pilot plant to an industrial unit. Chem. Eng. Process. Process
the entropy associated to the ANN training. Hence, it is more ef- Intensif. 42, 697–713. doi:10.1016/S0255-2701(02)00206-4.
fective to assume an ANN architecture based on a priori knowledge Book, N.L., Ramirez, W.F., 1984. Structural analysis and solution of systems of alge-
braic design equations. AIChE J. 30, 609–622. doi:10.10 02/aic.69030 0412.
and find patterns in the optimized parameter distributions rather Carranza-Abaíd, A., González-García, R., 2020. A Petlyuk distillation column dynamic
than trying to interpret generic ANN architectures. If the architec- analysis: Hysteresis and bifurcations. Chem. Eng. Process. Process Intensif. 149.
ture of an ANN has cohesion with the physics phenomenon de- doi:10.1016/j.cep.2020.107843.
Carranza-Abaid, A., Jakobsen, J.P., 2021. A computationally efficient formulation of
scription, the parameter entropy will be lower and therefore, there
the governing equations for unit operation design. Comput. Chem. Eng. 154,
is high likelihood that the model will provide repeatable param- 107500. doi:10.1016/j.compchemeng.2021.107500.
eters. NNP can be utilized to discard incorrect assumptions about Carranza-Abaid, A., Jakobsen, J.P., 2020. A non-autonomous relativistic frame of
reference for unit operation design. In: Computer Aided Chemical Engineer-
the physics phenomena.
ing. Comput. Aided Chem. Eng., pp. 151–156. doi:10.1016/B978- 0- 12- 823377-1.
Further work of NNP includes the application of this method for 50026-4.
the development of consistent thermodynamic and transport prop- Carranza-Abaid, A., Wanderley, R.R., Knuutila, H.K., Jakobsen, J.P., 2021. Analysis and
erty models. Additionally, we hypothesize that NNP can be utilized selection of optimal solvent-based technologies for biogas upgrading. Fuel 303,
121327. doi:10.1016/j.fuel.2021.121327.
to develop NNP-based PINNs in order to guarantee the exact exe- Caruana, R., Lou, Y., Gehrke, J., Koch, P., Sturm, M., Elhadad, N., 2015. Intelligible
cution of first principles equations. models for healthcare: Predicting pneumonia risk and hospital 30-day read-
mission. In: Proc. ACM SIGKDD Int. Conf. Knowl. Discov. Data Min. 2015-Augus,
pp. 1721–1730. doi:10.1145/2783258.2788613.
Declaration of Competing Interest Cellier, F.E., Elmqvist, H., 1993. Automated formula manipulation supports object-
oriented continuous-system modeling. IEEE Control Syst. 13, 28–38. doi:10.1109/
The authors declare that they have no known competing finan- 37.206983.
Chen, T., Guestrin, C., 2016. XGBoost. In: Proceedings of the 22nd ACM SIGKDD In-
cial interests or personal relationships that could have appeared to ternational Conference on Knowledge Discovery and Data Mining. ACM, New
influence the work reported in this paper. York, NY, USA, pp. 785–794. doi:10.1145/2939672.2939785.
Cybenko, G., 1989. Approximation by superpositions of a sigmoidal function. Math.
Control Signals Syst. 2, 303–314. doi:10.1007/BF02551274.
CRediT authorship contribution statement Daw, A., Karpatne, A., Watkins, W., Read, J., Kumar, V., 2017. Physics-guided Neural
Networks (PGNN): An Application in Lake Temperature Modeling.
Andres Carranza-Abaid: Conceptualization, Methodology, Soft- Destro, F., Facco, P., García Muñoz, S., Bezzo, F., Barolo, M., 2020. A hybrid frame-
work for process monitoring: enhancing data-driven methodologies with state
ware, Validation, Formal analysis, Validation, Data curation, Writ-
and parameter estimation. J. Process Control 92, 333–351. doi:10.1016/j.jprocont.
ing – original draft, Writing – review & editing, Visualization. 2020.06.002.
Jana P. Jakobsen: Resources, Supervision, Writing – review & edit- Dhooge, A., Govaerts, W., Kuznetsov, Y.A., 2003. MATCONT: a MATLAB package for
numerical bifurcation analysis of ODEs. ACM Trans. Math. Softw. 29, 141–164.
ing, Project administration, Funding acquisition.
doi:10.1145/779359.779362.
Duarte, B.P.M., Saraiva, P.M., 2003. Hybrid models combining mechanistic models
Acknowledgments with adaptive regression splines and local stepwise regression. Ind. Eng. Chem.
Res. 42, 99–107. doi:10.1021/ie0107744.
Eason, J., Cremaschi, S., 2014. Adaptive sequential sampling for surrogate model
We want to thank Tore Haug-Warberg for the fruitful and in- generation with artificial neural networks. Comput. Chem. Eng. 68, 220–232.
sightful discussions in the development of this work. Additionally, doi:10.1016/j.compchemeng.2014.05.021.

14
A. Carranza-Abaid and J.P. Jakobsen Computers and Chemical Engineering 163 (2022) 107858

Ebrahimpour, M., Yu, W., Young, B., 2021. Artificial neural network modelling for Nikolić, D.D., 2016. DAE Tools: equation-based object-oriented modelling, simu-
cream cheese fermentation pH prediction at lab and industrial scales. Food Bio- lation and optimisation software. PeerJ Comput. Sci. doi:10.7717/peerj-cs.54,
prod. Process. 126, 81–89. doi:10.1016/j.fbp.2020.12.006. 2016.
Elmqvist, H., 1978. A structured model language for large continuous systems. Lund Nuchitprasittichai, A., Cremaschi, S., 2013. Optimization of CO2 capture process with
Institute of Technology (LTH. aqueous amines - A comparison of two simulation-optimization approaches.
Faramarzi, L., Kontogeorgis, G.M., Michelsen, M.L., Thomsen, K., Stenby, E.H., 2010. Ind. Eng. Chem. Res. 52, 10236–10243. doi:10.1021/ie3029366.
Absorber Model for CO2 Capture by Monoethanolamine 3751–3759 doi:10.1021/ Oliveira, R., 2004. Combining first principles modelling and artificial neural net-
ie901671f. works: a general framework. Comput. Chem. Eng. 28, 755–766. doi:10.1016/j.
Fedorova, M., Sin, G., Gani, R., 2015. Computer-aided modelling template: concept compchemeng.2004.02.014.
and application. Comput. Chem. Eng. 83, 232–247. doi:10.1016/j.compchemeng. Peres, J., Oliveira, R., Feyo de Azevedo, S., 20 0 0. Knowledge based modular net-
2015.02.010. works for process modelling and control. Comput. Aided Chem. Eng. 8, 247–252.
Funahashi, K.I., 1989. On the approximate realization of continuous mappings by doi:10.1016/S1570-7946(0 0)80 043-7.
neural networks. Neural Netw. 2, 183–192. doi:10.1016/0893-6080(89)90 0 03-8. Perez, E., Strub, F., De Vries, H., Dumoulin, V., Courville, A., 2018. FiLM: Visual rea-
Gao, F., Huang, T., Wang, J., Sun, J., Yang, E., Hussain, A., 2018. Com- soning with a general conditioning layer. Proceedings of the AAAI Conference
bining deep convolutional neural network and SVM to SAR image tar- on Artificial Intelligence. AAAI 3942–3951 2018.
get recognition. In: Proc. - 2017 IEEE Int. Conf. Internet Things, IEEE Piela, P.C., Epperly, T.G., Westerberg, K.M., Westerberg, A.W., 1991. ASCEND: an
Green Comput. Commun. IEEE Cyber, Phys. Soc. Comput. IEEE Smart Data, object-oriented computer environment for modeling and analysis: the modeling
iThings-GreenCom-CPSCom-SmartData 2017 2018-Janua, 1082–1085 doi:10. language. Comput. Chem. Eng. 15, 53–72. doi:10.1016/0 098-1354(91)870 06-U.
1109/iThings- GreenCom- CPSCom- SmartData.2017.165. Poort, J.P., Ramdin, M., van Kranendonk, J., Vlugt, T.J.H., 2019. Solving vapor-liquid
Georgieva, P., Meireles, M.J., Feyo de Azevedo, S., 2003. Knowledge-based hybrid flash problems using artificial neural networks. Fluid Phase Equilibria 490, 39–
modelling of a batch crystallisation when accounting for nucleation, growth 47. doi:10.1016/j.fluid.2019.02.023.
and agglomeration phenomena. Chem. Eng. Sci. 58, 3699–3713. doi:10.1016/ Preisig, H.A., 2015. Constructing an ontology for physical-chemical processes.
S0 0 09-2509(03)0 0260-4. Comput. Aided Chem. Eng. 37, 1001–1006. doi:10.1016/B978- 0- 444-63577-8.
Güneş Baydin, A., Pearlmutter, B.A., Andreyevich Radul, A., Mark Siskind, J., 2018. 50012-7.
Automatic differentiation in machine learning: a survey. J. Mach. Learn. Res. 18, Psichogios, D.C., Ungar, L.H., 1992. A hybrid neural network-first principles approach
1–43. to process modeling. AIChE J. 38, 1499–1511. doi:10.1002/aic.690381003.
Hagan, M.T., Menhaj, M.B., 1994. Training feedforward networks with the Marquardt Raissi, M., Perdikaris, P., Karniadakis, G.E., 2019. Physics-informed neural networks:
algorithm. IEEE Trans. Neural Netw. 5, 989–993. doi:10.10 06/brcg.1996.0 066. a deep learning framework for solving forward and inverse problems involving
Heitzig, M., Linninger, A.A., Sin, G., Gani, R., 2014. A computer-aided framework for nonlinear partial differential equations. J. Comput. Phys. 378, 686–707. doi:10.
development, identification and management of physiologically-based pharma- 1016/j.jcp.2018.10.045.
cokinetic models. Comput. Chem. Eng. 71, 677–698. doi:10.1016/j.compchemeng. Ramirez, W.F., 1997. Computational Methods for Process Simulation. Butterworth
2014.07.016. Heinemann doi:10.1016/B978- 0- 7506- 1198- 5.50126- 0.
Hornik, K., 1991. Approximation capabilities of multilayer neural network. Neural Renon, H., Prausnitz, J.M., 1968. Local compositions in thermodynamic excess func-
Netw. 4, 251–257. tions for liquid mixtures. AICHE J. 14, 135–144.
Hornik, K., Stinchcombe, M., White, H., 1989. Multilayer feedforward networks are Ribeiro, M.T., Singh, S., Guestrin, C., 2016. Model-Agnostic Interpretability of Ma-
universal approximators. Neural Netw. 2, 359–366. doi:10.1016/0893-6080(89) chine Learning.
90020-8. Rosenblatt, F., 1962. Principles of Neurodynamics: Perceptrons and the Theory of
Huang, H.L., Wu, D., Fan, D., Zhu, X., 2020. Superconducting quantum computing: a Brain Mechanisms.
review. Sci. China Inf. Sci. 63, 1–32. doi:10.1007/s11432- 020- 2881- 9. Sansana, J., Joswiak, M.N., Castillo, I., Wang, Z., Rendall, R., Chiang, L.H., Reis, M.S.,
Jain, B.J., Geibel, P., Wysotzki, F., 2005. SVM learning with the Schur-Hadamard inner 2021a. Recent trends on hybrid modeling for Industry 4.0. Comput. Chem. Eng.
product for graphs. Neurocomputing 64, 93–105. doi:10.1016/j.neucom.2004.11. 151, 107365. doi:10.1016/j.compchemeng.2021.107365.
011. Sansana, J., Joswiak, M.N., Castillo, I., Wang, Z., Rendall, R., Chiang, L.H., Reis, M.S.,
Jaynes, E.T., 1957. Information theory and statistical mechanics. Phys. Rev. 106, 620– 2021b. Recent trends on hybrid modeling for Industry 4.0. Comput. Chem. Eng.
630. doi:10.1103/PhysRev.106.620. 151, 107365. doi:10.1016/j.compchemeng.2021.107365.
Kahrs, O., Marquardt, W., 2008. Incremental identification of hybrid process models. Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G.,
Comput. Chem. Eng. 32, 694–705. doi:10.1016/j.compchemeng.2007.02.014. Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S.,
Karl Johan Åström, H.E.S.M, 1998. Evolution of continuous-time modeling and sim- Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T., Leach, M.,
ulation. ESM 1–10. Kavukcuoglu, K., Graepel, T., Hassabis, D., 2016. Mastering the game of Go
Karniadakis, G.E., Kevrekidis, I.G., Lu, L., Perdikaris, P., Wang, S., Yang, L., 2021. with deep neural networks and tree search. Nature 529, 484–489. doi:10.1038/
Physics-informed machine learning. Nat. Rev. Phys. 3, 422–440. doi:10.1038/ nature16961.
s42254- 021- 00314- 5. Soares, R., de, P., Secchi, A.R., 2003. EMSO: a new environment for modelling,
Karpatne, A., Atluri, G., Faghmous, J.H., Steinbach, M., Banerjee, A., Ganguly, A., simulation and optimisation. Comput. Aided Chem. Eng. 947–952. doi:10.1016/
Shekhar, S., Samatova, N., Kumar, V., 2017. Theory-guided data science: a new S1570- 7946(03)80239- 0.
paradigm for scientific discovery from data. IEEE Trans. Knowl. Data Eng. 29, Sohlberg, B., Jacobsen, E.W., 2008. Grey box modelling – branches and experiences.
2318–2331. doi:10.1109/TKDE.2017.2720168. In: IFAC Proc. Vol.. IFAC doi:10.3182/20080706- 5- kr- 1001.01934.
Kuntsche, S., Barz, T., Kraus, R., Arellano-Garcia, H., Wozny, G., 2011. MOSAIC a Stephanopoulos, G., Henning, G., Leone, H., 1990. MODEL.LA. A modeling language
web-based modeling environment for code generation. Comput. Chem. Eng. 35, for process engineering-I. The formal framework. Comput. Chem. Eng. 14, 813–
2257–2273. doi:10.1016/j.compchemeng.2011.03.022. 846. doi:10.1016/0098- 1354(90)87040- V.
Leal, J.R., Romanenko, A., Santos, L.O., 2017. Daedalus modeling framework: build- Su, H.T., Bhat, N., Minderman, P.A., McAvoy, T.J., 1992. Integrating neural networks
ing first-principle dynamic models. Ind. Eng. Chem. Res. 56, 3332–3346. doi:10. with first principles models for dynamic modeling. IFAC Proc. Vol. 25, 327–332.
1021/acs.iecr.6b03110. doi:10.1016/s1474-6670(17)51013-7.
Letham, B., Rudin, C., McCormick, T.H., Madigan, D., 2015. Interpretable classifiers Tan, K.C., Li, Y., 2002. Grey-box model identification via evolutionary computing.
using rules and bayesian analysis: building a better stroke prediction model. Control Eng. Pract. 10, 673–684. doi:10.1016/S0967-0661(02)0 0 031-X.
Ann. Appl. Stat. 9, 1350–1371. doi:10.1214/15-AOAS848. Thompson, M.L., Kramer, M.A., 1994. Modeling chemical processes using prior
Linninger, A.A., Krendl, H., 1999. TechTool - computer-aided generation of pro- knowledge and neural networks. AIChE J. 40, 1328–1340. doi:10.1002/aic.
cess models (part 1-a generic mathematical language). Comput. Chem. Eng. 23, 690400806.
S703–S706. doi:10.1016/S0098-1354(99)80172-0. Tian, Y., Zhang, J., Morris, J., 2001. Modeling and optimal control of a batch poly-
MacKay, D.J.C., 1992. Bayesian interpolation. Neural Comput. 4, 415–447. doi:10. merization reactor using a hybrid stacked recurrent neural network model. Ind.
1162/neco.1992.4.3.415. Eng. Chem. Res. 40, 4525–4535. doi:10.1021/ie0010565.
Manevitz, L., Yousef, M., 2007. One-class document classification via neural net- Torrisi, M., Pollastri, G., Le, Q., 2020. Deep learning methods in protein structure
works. Neurocomputing 70, 1466–1481. doi:10.1016/j.neucom.2006.05.013. prediction. Comput. Struct. Biotechnol. J. 18, 1301–1310. doi:10.1016/j.csbj.2019.
Marquardt, D.W., 1963. An algorithm for least-squares estimation of nonlin- 12.011.
ear parameters. J. Soc. Ind. Appl. Math. 11, 431–441. doi:10.1137/0111030, Tsen, A.Y.Di, Jang, S.S., Wong, D.S.H., Joseph, B., 1996. Predictive control of quality in
https://doi.org/https://doi.org/. batch polymerization using hybrid ANN models. AIChE J. 42, 455–465. doi:10.
McCulloch, W.S., Pitts, W., 1943. A logical calculus of the ideas immanent in nervous 1002/aic.690420215.
activity. Bull. Math. Phys. 113–133. Tulleken, H.J.A.F., 1993. Grey-box modelling and identification using physical
Meng, Y., Yu, S., Zhang, J., Qin, J., Dong, Z., Lu, G., Pang, H., 2019. Hybrid modeling knowledge and bayesian techniques. Automatica 29, 285–308. doi:10.1016/
based on mechanistic and data-driven approaches for cane sugar crystallization. 0 0 05- 1098(93)90124- C.
J. Food Eng. 257, 44–55. doi:10.1016/j.jfoodeng.2019.03.026. Van Lith, P.F., Betlem, B.H.L., Roffel, B., 2003. Combining prior knowledge with
Morbach, J., Yang, A., Marquardt, W., 2007. OntoCAPE-A large-scale ontology for data driven modeling of a batch distillation column including start-up. Comput.
chemical process engineering. Eng. Appl. Artif. Intell. 20, 147–161. doi:10.1016/j. Chem. Eng. 27, 1021–1030. doi:10.1016/S0 098-1354(03)0 0 067-X.
engappai.2006.06.010. Venkatasubramanian, V., 2019. The promise of artificial intelligence in chemical en-
Nagy, Z.K., Szilagyi, B., Pal, K., Tabar, I.B., 2020. A novel robust digital design of a gineering: Is it here, finally? AIChE J. 65, 466–478. doi:10.1002/aic.16489.
network of industrial continuous cooling crystallizers of dextrose monohydrate: Vinyals, O., Babuschkin, I., Czarnecki, W.M., Mathieu, M., Dudzik, A., Chung, J.,
from laboratory experiments to industrial application. Ind. Eng. Chem. Res. 59, Choi, D.H., Powell, R., Ewalds, T., Georgiev, P., Oh, J., Horgan, D., Kroiss, M., Dani-
22231–22246. doi:10.1021/acs.iecr.0c04870. helka, I., Huang, A., Sifre, L., Cai, T., Agapiou, J.P., Jaderberg, M., Vezhnevets, A.S.,

15
A. Carranza-Abaid and J.P. Jakobsen Computers and Chemical Engineering 163 (2022) 107858

Leblond, R., Pohlen, T., Dalibard, V., Budden, D., Sulsky, Y., Molloy, J., Paine, T.L., Wilson, G.M., 1964. Vapor-liquid equilibrium. XI. A new expression for the excess
Gulcehre, C., Wang, Z., Pfaff, T., Wu, Y., Ring, R., Yogatama, D., Wünsch, D., free energy of mixing. J. Am. Chem. Soc. 86, 127–130. doi:10.1021/ja01056a002.
McKinney, K., Smith, O., Schaul, T., Lillicrap, T., Kavukcuoglu, K., Hassabis, D., Xiong, Q., Jutan, A., 2002. Grey-box modelling and control of chemical processes.
Apps, C., Silver, D., 2019. Grandmaster level in StarCraft II using multi-agent re- Chem. Eng. Sci. 57, 1027–1039. doi:10.1016/S0 0 09-2509(01)0 0439-0.
inforcement learning. Nature 575, 350–354. doi:10.1038/s41586- 019- 1724- z. Xu, Y., Verma, D., Sheridan, R.P., Liaw, A., Ma, J., Marshall, N.M., McIntosh, J.,
Wang, F., Rudin, C., 2015. Falling rule lists. J. Mach. Learn. Res. 38, 1013–1022. Sherer, E.C., Svetnik, V., Johnston, J.M., 2020. Deep dive into machine learning
Wang, X., Chen, J., Liu, C., Pan, F., 2010. Hybrid modeling of penicillin fermentation models for protein engineering. J. Chem. Inf. Model. 60, 2773–2790. doi:10.1021/
process based on least square support vector machine. Chem. Eng. Res. Des. 88, acs.jcim.0c0 0 073.
415–420. doi:10.1016/j.cherd.2009.08.010. Zorzetto, L.F.M., Maciel Filho, R., Wolf-Maciel, M.R., 20 0 0. Process modelling devel-
Westerweele, M.R., Laurens, J., 2008. Mobatec Modeller - A flexible and transparent opment through artificial neural networks and hybrid models. Comput. Chem.
tool for building dynamic process models. Comput. Aided Chem. Eng. 25, 1045– Eng. 24, 1355–1360. doi:10.1016/S0 098-1354(0 0)0 0419-1.
1050. doi:10.1016/S1570- 7946(08)80180- 0.

16

You might also like