Journal Pre-Proofs: Chemical Engineering Science

Journal Pre-proofs
A Mixed-Integer Linear Programming based Training and Feature Selection

Method for Artificial Neural Networks using Piece-wise Linear Approxima-
tions
Hasan Sildir, Erdal Aydin
PII: S0009-2509(21)00838-1
DOI: https://doi.org/10.1016/j.ces.2021.117273
Reference: CES 117273
To appear in: Chemical Engineering Science
Received Date: 25 May 2021

Revised Date: 28 September 2021
Accepted Date: 12 November 2021
Please cite this article as: H. Sildir, E. Aydin, A Mixed-Integer Linear Programming based Training and Feature
Selection Method for Artificial Neural Networks using Piece-wise Linear Approximations, Chemical
Engineering Science (2021), doi: https://doi.org/10.1016/j.ces.2021.117273
This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover
page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version
will undergo additional copyediting, typesetting and review before it is published in its final form, but we are
providing this version to give early visibility of the article. Please note that, during the production process, errors
may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
© 2021 Published by Elsevier Ltd.

A Mixed-Integer Linear Programming based Training and
Feature Selection Method for Artificial Neural Networks
using Piece-wise Linear Approximations
Hasan Sildira, Erdal Aydinb,c,d
aDepartment of Chemical Engineering, Gebze Technical University, Kocaeli 41400, Turkey
bDepartment of Chemical Engineering, Bogazici University, Bebek, Istanbul 34342, Turkey
cDepartment of Chemical and Biological Engineering, Koc University, Istanbul 34457,
Turkey
dKoc University TUPRAS Energy Center (KUTEM), Koc University, Sariyer, Istanbul,
34450, Turkey
Highlights
 A piece-wise linear ANN training is proposed.

 Non-convex layers in ANNs are approximated using piece-wise linear segments.
 The method detects the optimum features to be selected for ANNs.
 Layer and objective function convexities bring about unique solutions.
 Convexity of the method benefits from computational advantages.
Abstract
Artificial Neural Networks (ANNs) may suffer from suboptimal training and test performance
related issues not only because of the presence of high number of features with low
statistical contributions but also due to their non-convex nature. This study develops
piecewise-linear formulations for the efficient approximation of the non-convex activation
and objective functions in artificial neural networks for optimal, global and simultaneous
training and feature selection in regression problems. Such formulations include binary
variables to account for the existence of the features and piecewise-linear approximations,
which in turn, after one exact linearization step, calls for solving a mixed-integer linear
programming problem with a global optimum guarantee because of convexity. Suggested
formulation is implemented on two industrial case studies. Results show that efficient
approximations are obtained through the usage of the method with only a few number of
breakpoints. Significant feature space reduction is observed bringing about notable
improvement in test accuracy.
Keywords
machine learning
artificial neural networks
piece-wise linear artificial neural networks
feature selection
mixed-integer programming
1 Introduction
An Artificial Neural Network (ANN) realizes the structured combination of relatively

simple mathematical expressions to yield a more flexible and descriptive one in order
to describe the input and output relationships within a particular process. ANNs are
black-box models and do not carry insight on the physics of the process unlike the first
principle models which account for the underlying driving forces at the fundamental
level. Thus, ANNs are classified under empirical models in which the performance is
highly dependent on the quality of the data set in addition to the mathematical
formulations having a wide range of architectural variations. On the other hand, there
is a significant increase in ANN based applications in the recent decades both due to
the limited knowledge on the process dynamics for the development of first principle
models and availability of large amount of data with sensor and computational
advancements.
Feed-forward artificial network is a special type of ANNs among many ( Rosa et al.,
2020 )⁠. The information propagates only in forward direction in this architecture and
our study adapts such a formulation throughout this paper. Those ANNs include a
single hidden layer in addition to input and output layers, all of which might be
activated using different activation functions. Flexibility and representative capability
of the ANNs are highly dependent on the inputs, the number of hidden layer neurons
( Vujičić and Matijevi, 2016 )⁠, the selection of activation functions ( Sibi et al., 2013 )⁠,
the training algorithm ( Plumb et al., 2005 )⁠, and data pre-processing ( Nawi et al.,
2013 )⁠. Such high number of hyper-parameters may cause cumbersome training,
which in turn usually lead to test performance issues. In parallel, an extensive amount
of manual trial-error procedure is required until a satisfactory network architecture is
obtained. Furthermore, since training feed-forward and fully connected ANNs calls for
the solution of highly non-convex optimization problems, results of the training
problems might represent local suboptimal solutions. In addition, an ANN also exhibits
non-convex behavior with respect to the features, besides the activation function. Thus,
a rigorous method addressing aforementioned issues is required but challenging to be
obtained theoretically at the same time.
One possible prescription for suboptimal training related issues is to use global solution
methods and solvers for training. On the other hand, note that one vital drawback of
the global non-convex optimization methods is the required computational power and
time. In their pioneering work, Mitsos and Schweidtmann introduced relaxation and
reduction methods for global optimization problems embedded with trained and non-
convex artificial neural networks ( Schweidtmann and Mitsos, 2019 )⁠. Afterwards,
following a similar framework, they suggested global nonlinear model predictive

controller designs with artificial neural networks embedded ( Doncevic et al., 2020 )⁠.
Moreover, it is also emphasized in the literature that there exist only limited cases
where global minimum can be ensured with such non-smooth objective functions,
caused by non-linear interactions among network variables with high number of
suboptimal solutions ( Swirszcz et al., 2016 )⁠. Such sub-optimal solutions usually
hinder the utilization of existing hidden layer neurons. In ( Atakulreka and Sutivong,
2007 )⁠, a removal criteria is introduced to favor more promising ANN applications
through a simultaneous training algorithm, which in turn reduces the probability of

sub-optimal solutions. Wang et al. modified the backpropagation algorithm to avoid
suboptimal solutions caused by neuron saturation in the hidden layer ( Wang et al.,
2004 )⁠. However, the non-convex nature of the problem is still a major issue in
training and further design considerations including the feature selection ( Sildir et al.,
2020 )⁠.
Piecewise approximations have been widely used in the literature in search for the
global optimization of nonlinear functions including ANN related problems (Storace
Commented [A1]: AUTHOR: Ref(s). 'De Feo, 2004'
and De Feo, 2004; Yang et al., 2016 )⁠. Such an approach converts the traditional and is/are cited in the text but not provided in the reference list.
Please provide it/them in the reference list or delete these
citations from the text.
non-convex nonlinear programming (NLP) problems into mixed integer programming
problems ( Polisetty and Gatzke, 2005 ; Vielma, 2015 )⁠. Wen and Ma used piecewise
linear activation functions for ANNs in the hidden layer and obtained a balance
between the computational simplicity and accuracy ( Wen and Ma, 2008 )⁠. Rister and
Rubin ( Rister and Rubin, 2017 )⁠ applied piecewise linearization on nonlinear functions
within the ANN architecture to obtain piecewise convexity and multi-convexity in each
layer. They also yield that under squared error objective functions, such biconvex and
piece-wise affine formulations may result in far local optimal solutions. Recently, a
similar approach was proposed in ( Bunel et al., 2019 )⁠ where big-M formulations are
included for the verification problem using trained ANNs with RelU (Rectified Linear
Unit) activation functions. On the other hand, please note that the aforementioned
formulations are limited to using the RelU activation function, which is a piecewise
linear function by nature with a discontinuity at the origin. Furthermore, note that using
a purely linear activation function might contribute to poor training performance, and
also introduces non-smoothness to the training problem. Therefore, using and
approximating highly nonlinear activation functions, e.g. hyperbolic tangent, in a
piece-wise linear fashion would be more beneficial for many nonlinear processes in
chemical engineering related problems.
Piece-wise linear formulations require the addition of breakpoints for each piece-wise
linear segment. Afterwards, binary and auxiliary variables are introduced into the
formulation in order to enforce the selection of the proper segments. This way, proper
calculations of the convex (linear) combinations among the adjacent segments can be
realized during the optimization iteration. Accordingly, the original non-convex NLP
problem is approximated into a mixed integer linear programming problem (MILP),
whose optimal solution, if exists, is unique due to its convexity ( D’Ambrosio et al.,
2010 )⁠. The associated issue with this method is that there is discontinuity at the
breakpoints. Nevertheless, this problem can be avoided using state-of-the-art solvers

using branch-and-bound or branch-and-cut algorithms. As a result, the non-convex
NLP problem for ANN training is approximated into a convex problem where
generalization in a global sense within the ANN is not an issue anymore ( Bunel et al.,
2019, 2017 ; Dutta et al., 2017 )⁠.
Binary variables have been also used for the architectural design of ANNs, besides the
piecewise representation of the activation functions. Dua ( Dua, 2010 )⁠ presented a
mixed-integer formulation to account for the existence of connections and hidden layer
neurons in the objective function. Although the selection of the features is not
implemented explicitly, some solutions might yield the elimination of the particular
features once all of their connections to hidden layer is removed by the algorithm. In
( Matias et al., 2014 )⁠, a binary variable is introduced to account for the existence of a
particular feature, enabling simultaneous training and feature selection. Similarly, the
authors recently proposed a promising way for optimal feature selection and
regularization for ANNs using mixed-integer nonlinear programming ( Sildir et al.,
2021 , 2020)⁠. However, one bottleneck of the proposed algorithm is the non-
convexity of the feature selection and pruning problem. A quasi-decomposition method

is required due to the complexity of the algorithm. Unless a global algorithm is used
for the relaxed NLP in the decomposition loop, the overall solution procedure might
result in an infeasible point or even diverge. Accordingly, it is proposed to use
derivative-free solvers to solve the proposed optimization (MINLP) problems. On the
other hand, this issue can be tackled by employing piece-wise linear (PWL)
approximation for the proposed pruning and input selection algorithms. As a
consequence, resulting MILP problem can be decomposed and/or parallelized using
state-of-the-art mixed integer programming solvers, e.g. CPLEX, dealing with the
overall complexity of the feature selection problem in a better and reliable sense.
This paper proposes a simultaneous global training and feature selection algorithm for
feed-forward ANNs. The non-convex activation functions in the hidden layer and the
objective function for the training problem are approximated through piecewise linear
functions to ensure convexity in the hidden layer, while computing the best features to
be selected during training. In addition to the piece-wise linear approximations utilized
for the hidden layer and the objective function, if the input weight matrix is set to have
only integer values, the formulation warrants the ability to linearize the input layer
exactly. Another practical step is to train the original ANN and use the obtained optimal
input weight values for the feature selection. Accordingly, globally optimum features
can be selected during the training phase due to the overall convexity of the suggested
method in a simultaneous fashion.
Selection of the best features in a global way using convex optimization formulations
for artificial neural networks is a scarce subject in the literature to the best of the
author’s knowledge. Furthermore, as aforementioned before, suggested method is
applicable to all kinds of activation functions that can be present in ANNs, unlike many
other similar contributions in the literature. Furthermore, resulting optimization
problems can be decomposed and parallelized in a reliable way thanks to the convexity
of the possible sub-problems in the decomposition iterations. Finally, presented
method differentiates from other network reduction algorithms such as dropout
( Srivastava et al., 2014 )⁠, pruning ( Kavzoglu and Mather, 1999 ), and group method of
data handling ( Sattari et al., 2020 )⁠ through simultaneous feature selection and training
using rigorous and convex optimization algorithms ensuring uniqueness.

Section II provides the methodology and formulations related to the piecewise linear
ANN (PWL-ANN) training and feature selection problem. Section III includes the
results and discussion on two industrial case studies. Finally, section IV concludes this
study.
2 Methodology
The mathematical representation of a fully-connected feed-forward artificial neural

network (ANN), mapping the inputs to the outputs is given as follows:
𝑦 = 𝑓1(𝑤𝑇1𝑓2(𝑤𝑇2𝑢 + 𝑏1) + 𝑏2) (2.1)

where f1 and f2 are output and hidden layer activation functions, respectively. w1 and
w2 are weight matrices; b1 and b2 are bias vectors; u is the input vector and y is the
output vector with the proper dimensions. We use hyperbolic tangent activation
function at the hidden layer and identity activation function at the input and output
layers while input and output layer activation functions are not shown explicitly in the
above formulation. Traditionally, the parameters w1, w2, b1 and b2 are estimated
through nonlinear optimization using the following objective function for regression:
𝑁
𝑚𝑖𝑛𝑤1,𝑤2,𝑏1,𝑏2∑𝑖 = 1‖𝑓1(𝑤𝑇1𝑓2(𝑤𝑇2𝑢𝑖 + 𝑏1) + 𝑏2) ― 𝑦𝑖‖ (2.2)
where ui and yi are the ith input and output data samples, respectively; N is the number
of samples used for training. The dimensions of those parameters depend on the
number of inputs, outputs and neurons, called as hyper-parameters and they are usually
determined using a trial and error procedure.
Statistically irrelevant or unnecessary features can be eliminated from the ANN
architecture using the following notation for the ANN calculations:
𝑦 = 𝑓1(𝑤𝑇1𝑓2(𝑤𝑇2(𝑢 ⊙ 𝑢𝑆) + 𝑏1) + 𝑏2) (2.3)

where uS is the binary input selection vector; ⊙ is the element wise multiplication
operator and y is the output vector. Please note that the only difference to Equation
(2.1) in this formulation is the introduction of the binary (0-1) input selection vector
which enables the selection of the best features in the optimization. When the value of
this decision variable is equal to zero, the contribution from the particular input is
eliminated. Conversely, when the value of this decision variable is equal to one, the
corresponding feature exists. Here, it should be mentioned that the selection of the best
features may remind the Lasso formulation. On the other hand, unlike the Lasso, the
main idea in this work is to decide on the existence of the connections using binary
variables, instead of continuous ones, which in turn increases the strength of the
formulation.
Forward computation of the network during the training phase for a particular feature
requires the calculation of both continuous weights and bias values. This unfortunately
might be an issue for classical back-propagation and stochastic gradient descent
algorithms. However, automatic differentiation algorithms are quite able to be
employed for such computation ( Güneş Baydin et al., 2018 )⁠, for a particular
architecture since the sensitivities are applicable to continuous variables only. Finally,
the optimal feature selection can be represented as a non-convex optimization problem
containing both continuous and discrete decision variables.
Accordingly, the training and optimal feature selection problem, which is a mixed-
integer nonlinear programming (MINLP) problem is given as follows:
𝑁 2
𝑚𝑖𝑛𝑤1,𝑤2,𝑏1,𝑏2,𝑢𝑠∑𝑖 = 1 (𝑓1(𝑤𝑇1𝑓2(𝑤𝑇2(𝑢𝑖 ⊙ 𝑢𝑆) + 𝑏1) + 𝑏2) ― 𝑦𝑖)
𝑠.𝑡.
𝑁
∑𝑖 = 1 (𝑓1(𝑤𝑇1𝑓2(𝑤𝑇2(𝑢𝑖 ⊙ 𝑢𝑆) + 𝑏1) + 𝑏2) ― 𝑦𝑖)2 ≤ 𝑀𝑆𝐸𝑑𝑒𝑠𝑖𝑟𝑒𝑑, (2.4)
𝑢𝑚𝑖𝑛
𝑆 ≤ ∑𝑢𝑠 ≤ 𝑢𝑚𝑎𝑥𝑆 ,
𝑢𝑆 ∈ {0,1}.
where MSEdesired is the desired mean square error (MSE); usmin and usmax are minimum
and maximum number of features to be selected, respectively.
The objective of Problem (2.4) is to minimize the MSE for training, while a pre-
selected value, MSEdesired, is enforced as an upper limit for the error. This limit can be
estimated by training a standard ANN using the same data-set after scaling this
variable with a value of, for instance 1.1. In general, as more features are employed,
MSE is expected to decrease since number of hyper-parameters increases. However,
at the same time, overfitting possibility increases, especially for lower MSEdesired
values in training. Therefore, the formulation in Problem (2.4) exhibits a trade-off
between overfitting and training error while computing the best features to be
selected throughout the training. This way, the effect of overfitting can be minimized.
Consequently, the trained ANN with the selected features might result in higher
training error than the ANN with all features, while obtaining less test error with
reduced overfitting. Finally, please note Problem (2.4) is equivalent to standard ANN
training if minimum and maximum number of features to be selected are enforced to

be equal to the number of original features. Our investigations on particular case
studies yield that the proposed method never selects all of the features though.
Problem 2.4 still exhibits non-linear behavior and non-convexity related
generalization issues. In addition, the binary value vector present in the quadratic
objective function distorts the convexity of the objective function. Our observations
showed that solving Problem 2.4 using derivative-free algorithms results in sub-
optimal local solutions, which cannot fully satisfy the aim of the optimal feature
selection ( Sildir et al., 2021 )⁠. On the other hand, Problem 2.4 can be converted into
an MILP problem using piece-wise linear approximation methods ( D’Ambrosio et al.,
2010 ; Vielma, 2015 )⁠.
The first step in the proposed PWL-ANN method is to linearize the original non-
linear activation function, hyperbolic tangent. Secondly, the objective function of the
training problem for ANNs, which is formulated as the sum of the squared error, is
reformulated to be equal to the absolute value of the sum of the errors through piece-
wise linearization. This is required in order to linearize the non-convex objective
function, whose non-convexity is caused by the squaring of the binary variables
associated with the feature selection. Finally, the overall training and feature selection
problem can be solved simultaneously together with the PWL approximation
equations.
Figure 2.1 fig1 illustrates ways of approximating the hyperbolic tangent function for
different number of piece-wise linear segments ( D’Ambrosio et al., 2010 )⁠. More
detailed review on the required pieces for different functions can be found in
( Frenzen et al., 2010 )⁠. Please note that the proposed PWL formulation is generic in
nature and can be implemented to different activation functions e.g. sigmoid or RelU.
The piece-wise linear approximation for the hyperbolic tangent function, tanh(x), can
be implemented using five segments as follows:
0.05𝑥 ― 0.85 𝑖𝑓 ―8 ≤ 𝑥 < ―2,
0.2𝑥 ― 0.56 𝑖𝑓 ―2 ≤ 𝑥 < ―1,
tanh(𝑥) ≈ { 0.76𝑥 𝑖𝑓 ―1 ≤ 𝑥 < 1, (2.5)
0.2𝑥 + 0.56 𝑖𝑓 1 ≤ 𝑥 < 2,
0.05𝑥 + 0.85 𝑖𝑓 2 ≤ 𝑥 ≤ 8.
After assigning the output layer activation function, f1, to be equal to identity, the
training and feature selection problem using five piece-wise linear segments can be
formulated as follows:
𝑁 2
𝑚𝑖𝑛𝑤1,𝑤2,𝑏1,𝑏2,𝑢𝑠𝑣 = ∑𝑖 = 1 ((𝑤𝑇1𝑓′2(𝑞) + 𝑏2) ― 𝑦𝑖)
𝑠.𝑡.
𝑞 = (𝑤𝑇2(𝑢𝑖 ⊙ 𝑢𝑆) + 𝑏1),
―8 ≤ 𝑞 ≤ 8,
―8𝜆1 ― 2𝜆2 ― 1𝜆3 + 1𝜆4 + 2𝜆5 + 8𝜆6 = 𝑞,
―1.29𝜆1 ― 0.96𝜆2 ― 0.76𝜆3 + 0.76𝜆4 + 0.96𝜆5 + 1.29𝜆6 = 𝑓′2(𝑞),
𝜆1 + 𝜆2 + 𝜆3 + 𝜆4 + 𝜆5 + 𝜆6 = 1,
𝛽1 + 𝛽2 + 𝛽3 + 𝛽4 + 𝛽5 = 1,
𝜆1 ≤ 𝛽1,
𝜆2 ≤ 𝛽1 + 𝛽2, (2.6)
𝜆3 ≤ 𝛽2 + 𝛽3,
𝜆4 ≤ 𝛽3 + 𝛽4,
𝜆5 ≤ 𝛽4 + 𝛽5,
𝜆6 ≤ 𝛽5,
𝑣 ≤ 𝑀𝑆𝐸𝑑𝑒𝑠𝑖𝑟𝑒𝑑,
𝑢𝑚𝑖𝑛
𝑢𝑆 ∈ {0,1},
𝜆𝑘⩾0,𝑘 = 1,...,6,
𝛽𝑚 ∈ {0,1},𝑚 = 1,...,5,
―40 ≤ 𝑤1,𝑤2,𝑏1,𝑏2 ≤ 40.
where v is the objective function being the sum of squared errors; q is the input for
the approximated hyperbolic tangent function, f’2; λk are the positive auxiliary
variables and βm are the binary variables associated with the breakpoints where m
stands for the linear segments of the approximation. For more detailed information
about formulating the piece-wise linear approximations for non-convex optimization
problems, the reader is referred to ( Vielma, 2015 ).⁠
Problem (2.6) still exhibits non-linear behavior because of the bilinear terms in the
objective function and in the equality constraint for q. On the other hand, the bilinear
equality constraint for q includes the multiplication of a binary and a continuous
variable. Accordingly, after detecting the upper and lower bound values of these
variables either using trial and error or interval arithmetics, these variables can be
disaggregated and linearized in an exact manner. Moreover, the weight matrix w1 can
be set to have only integer values, which is the case in binarized ANNs ( Hubara et al.,
2016 )⁠. This reformulation will enable another exact linearization step to be
employed for the multiplication of w1 and f’2. Another practical approach would be to
train the original ANN and use the optimal results, w1, instead of the decision
variable weight matrix w1. Note that this is a standard procedure for extreme learning
machine ( Ding et al., 2015 )⁠. In this work, for simplicity, we use fixed values for the
weight matrix w1 and employ exact linearization for the multiplication of w2 and us.
Resulting formulation is given in Problem (2.7) .
𝑁 2
𝑚𝑖𝑛𝑤2,𝑏1,𝑏2,𝑢𝑠𝑣 = ∑𝑖 = 1 ((𝑤𝑜𝑇1 𝑓′2(𝑞) + 𝑏2) ― 𝑦𝑖)
𝑠.𝑡.
𝑞 = 𝑤𝑇2𝑢𝑖 + 𝑏1,
―8 ≤ 𝑞 ≤ 8,
―|𝑤𝑙𝑜2 |𝑢 𝑢𝑝
𝑠 ≤ 𝑤2 ≤ |𝑤2 |𝑢𝑠,
―8𝜆1 ― 2𝜆2 ― 1𝜆3 + 1𝜆4 + 2𝜆5 + 8𝜆6 = 𝑞,
―1.29𝜆1 ― 0.96𝜆2 ― 0.76𝜆3 + 0.76𝜆4 + 0.96𝜆5 + 1.29𝜆6 = 𝑓′2(𝑞),
𝜆1 + 𝜆2 + 𝜆3 + 𝜆4 + 𝜆5 + 𝜆6 = 1,
𝛽1 + 𝛽2 + 𝛽3 + 𝛽4 + 𝛽5 = 1,
𝜆1 ≤ 𝛽1,
𝜆2 ≤ 𝛽1 + 𝛽2, (2.7)
𝜆3 ≤ 𝛽2 + 𝛽3,
𝜆4 ≤ 𝛽3 + 𝛽4,
𝜆5 ≤ 𝛽4 + 𝛽5,
𝜆6 ≤ 𝛽5,
𝑣 ≤ 𝑀𝑆𝐸𝑑𝑒𝑠𝑖𝑟𝑒𝑑,
𝑢𝑚𝑖𝑛
𝑆 ≤ ∑𝑢𝑠 ≤ 𝑢𝑚𝑎𝑥
𝑆 ,
𝑢𝑆 ∈ {0,1},
𝜆𝑘⩾0,𝑘 = 1,...,6,
𝛽𝑚 ∈ {0,1},𝑚 = 1,...,5,
―40 ≤ 𝑤2,𝑏1,𝑏2 ≤ 40.
where |w2 | and |w2 | are the absolute values of the lower and upper bounds of the
lo up
decision variable matrix w2. These bounds are decided based on our previous
numerical investigations; values above and below 40 and -40, respectively, did not
provide improvement for the training case using the hyperbolic tangent function.
Nevertheless, interval analysis could also be performed to calculate the effective
bounds for the ANNs. Please note that the new term obtained for the multiplication of
w2 and us after the exact linearization step also acts as a linking constraint since the
corresponding weight value is assigned to be zero once a particular feature is not
selected. This constraint reduces the overall complexity of the original problem as
less number of sensitivity calculations are to be made by the solver. Finally, instead
of using sum of squared values for the objective function as given in Problem 2.7, we
propose to convert the objective to the minimization of the sum of the absolute error
values. This way, the non-linearity in the objective function, stemming from squaring
the binary values, is avoided. Apparently, this reformulation brings about a non-
smooth objective function which is a huge issue for derivative-based optimization.
On the other hand, the absolute value function can be further linearized exactly,
which requires the addition of only a single binary variable to detect whether the
target value is positive or not during the optimization iteration ( Mangasarian, 2013 )⁠.
Resulting formulation, which is an MILP (PWL-ANN) is given as follows:

𝑁
𝑚𝑖𝑛𝑤2,𝑏1,𝑏2,𝑢𝑠|𝑣′| = ∑𝑖 = 1|(𝑤𝑜𝑇1 𝑓′2(𝑞) + 𝑏2) ― 𝑦𝑖| = 𝑣′𝑝 + 𝑣′𝑛
𝑠.𝑡.
𝑣′ = 𝑣′𝑝 ― 𝑣′𝑛,
𝑣′𝑝 ≤ |𝑣′𝑢𝑝|(𝑧),
𝑣′𝑛 ≤ |𝑣′𝑙𝑜|(1 ― 𝑧),
|𝑣′| = 𝑣′𝑝 + 𝑣′𝑛,
𝑞 = 𝑤𝑇2𝑢𝑖 + 𝑏1,
―8 ≤ 𝑞 ≤ 8,
―|𝑤𝑙𝑜 𝑢𝑝
2 |𝑢𝑠 ≤ 𝑤2 ≤ |𝑤2 |𝑢𝑠,
―8𝜆1 ― 2𝜆2 ― 1𝜆3 + 1𝜆4 + 2𝜆5 + 8𝜆6 = 𝑞,
―1.29𝜆1 ― 0.96𝜆2 ― 0.76𝜆3 + 0.76𝜆4 + 0.96𝜆5 + 1.29𝜆6 = 𝑓′2(𝑞),
𝜆1 + 𝜆2 + 𝜆3 + 𝜆4 + 𝜆5 + 𝜆6 = 1,
(2.8)
𝛽1 + 𝛽2 + 𝛽3 + 𝛽4 + 𝛽5 = 1,
𝜆1 ≤ 𝛽1,
𝜆2 ≤ 𝛽1 + 𝛽2,
𝜆3 ≤ 𝛽2 + 𝛽3,
𝜆4 ≤ 𝛽3 + 𝛽4,
𝜆5 ≤ 𝛽4 + 𝛽5,
𝜆6 ≤ 𝛽5,
|𝑣′| ≤ 𝐴𝐸𝑑𝑒𝑠𝑖𝑟𝑒𝑑,
𝑢𝑚𝑖𝑛
𝑢𝑆 ∈ {0,1},
𝜆𝑘⩾0,𝑘 = 1,...,6,
𝛽𝑚 ∈ {0,1},𝑚 = 1,...,5,
―40 ≤ 𝑤2,𝑏1,𝑏2 ≤ 40.
where z is the binary variable used to detect whether the target value is positive or
not, |v’| is the reformulated objective function being the total absolute error, v’p and
v’n are the positive and negative parts of the new objective function, respectively. |v‘
lo| and |v‘ up| are the absolute values of the lower and upper bounds of the objective
function. AEdesired is the desired absolute error which can be computed using a similar
procedure to the detection of MSEdesired.
Finally, the PWL-ANN training problem is a mixed-integer linear programming
problem whose optimal solution is unique and global if exists. Please note that the
training problem (Equation (2.8) ) can be simplified further using special ordered set
(SOS) variables. Furthermore, this problem can be efficiently decomposed due to the
convexity of the relaxed subproblems or parallelized distributionally on many
computation cores using CPLEX, which is a crucial property for the scalability of the
proposed method especially for larger data sets ( Bliek et al., 2014 ; Shinano and Fujie,
2007 )⁠.
3 Results and Discussion
Suggested PWL-ANN framework is implemented on two different case studies to show

the contribution of the proposed PWL-ANN method over traditional ANN approach.
The first one includes a publicly available data on a distillation tower and the second
case study is founded on an industrial plant data. The training and test errors are
evaluated using mean square error (MSE) and mean absolute error (MAE). Such
statistics are reported using one-fold cross validation based on the corresponding
training and test data which are obtained by random sampling at predetermined training
ratio.
All standard ANN implementation results are obtained using the Matlab ANN Toolbox
with the Levenberg-Marquardt training algorithm and early stopping criteria. Ten runs
with different parameter initializations are made for the regular ANN training steps and
the best results are reported although each run delivered a similar training and test
performance in our case. This step is not required for the PWL-ANN computations due
to the convexity. Suggested feature selection and training problem (PWL-ANN) is
coded and solved using the GAMS language and CPLEX solver. The data is scaled
between -1 and 1 for numerical and proprietary reasons.
3.1 Case Study A
The first case study includes 253 snapshot measurements from an industrial distillation
tower over 2.5 years, 150 of which are used for the training ( Dunn, 2021 )⁠. The quality
related variable, vapor pressure, is measured in the laboratory and the ultimate output
to be inferred from 26 online measurements which include the temperatures, pressures
and flow rates in from the tower. Due to proprietary reasons, the locations of those
sensors or more details on the process are not disclosed.
Firstly, we tested the approximation performance of the proposed PWL-ANN and
standard ANN by evaluating the training data set with parameters obtained from the
training of the standard ANN using all features, hyperbolic activation function. In other
words, the approximation capability of the PWL-ANN on a standard trained ANN is
examined. The comparison is provided in Table 1a tbl1 for three and five neurons in
the hidden layer. In addition, the number of pieces in the approximation of the
activation function and their coordinates are required for the feedforward evaluation in
the PWL-ANN. Table 1a includes the evaluation of training and test data using PWL-
ANN using 3, 5, and 7 pieces for the approximation of hyperbolic tangent function (see
Commented [A2]: AUTHOR: Please note that Table 1b.
Fig. 2.1 ). was not cited in the text. Please check that the citation
suggested by the copyeditor is in the appropriate place, and
correct if necessary.
Typically, higher number of pieces might be expected to result in better
approximations, but both coordinates of these pieces and original dataset may have
significant impact on the issue. In this particular case, we yield that higher number of
pieces in the approximation does not contribute to significant increase in the validity
of the approximation. On the other hand, increasing the number of pieces might be
expected to provide crucial contribution in terms of approximation once the data are
collected from regions of significant mismatch between the original activation function
and the piecewise representation.
As shown in Table 1a , the PWL-ANN delivers similar training and test performances
using both 3 and 5 number of hidden neurons. Additionally, 3 pieces with given
coordinates (see Fig. 2.1 ) provided satisfactory representation of the activation
function.
Table 1a also reveals that the PWL-ANN structure with the fixed weight values provide
similar training and test performances compared to the standard ANN with hyperbolic
tangent function. This result shows the acceptable approximation capability of the
PWL-ANN, for this particular case. On the other hand, we should here mention that
even better training and test performances are achievable with training the network
using the proposed PWL-ANN architecture since PWL-ANN provides a convex
structure for training. At the same time, a training solution, regardless of being the
global one or not, can also be verified in a rather acceptable error tolerance using the
suggested PWL-ANN method.
After verifying the PWL-ANN for the full input space, optimal feature selection
algorithm is employed. Optimal weight values obtained for the standard ANN were set
as initial guesses for the PWL-ANN training algorithm, which accounts for the feature
selection as well, and the problem is solved using CPLEX in GAMS language to an
absolute and relative tolerance level of 10-3 to reduce the input space with the convex
optimal feature selection. Results are given in Table 1b tbl2 and include the PWL-ANN
and ANN performances using the selected features: TempC2, Temp8, InvTemp1,
InvTemp3 and InvPressure1. Finally, it is worth here to mention that several runs from
different initial guesses can be avoided once the PWL-ANN is used since same training
results are obtained due to the convexity of the formulation.
A major over-fitting indication is the relatively large difference between training and
test performances. Further, higher number of inputs and hidden layer neurons are
expected to contribute to over-fitting since more parameters and connections are
introduced. In such cases, training error, with a high flexibility for fitting, is reduced
significantly but leads to significant test error as shown in Table 1a for ANNs with no
input selection. Table 1b demonstrates that even though standard ANN (ANN) results
in better training performance, which is typical owing to the fact that PWL-ANN
provides an upper bound to the standard ANN training problem due to approximation,
PWL-ANN corresponds to a better test performance using the selected optimum
features. This result can be claimed to be expected because the optimum features are
computed using the PWL-ANN method, not the standard non-convex ANN method.
However, it is also observed that the test performance of the standard ANN increases
significantly using the selected features, showing the contribution of the proposed
method for standard ANNs as well. The results are also presented in Fig. 3.1 fig2 and
Fig. 3.2 fig3 .

Improved test results obtained through PWL-ANN feature selection demonstrates that
the reduced feature space delivers much better test performance, despite increased
training error. In addition, reduced over-fitting can also be detected through the
agreement of training and test performances.
One of the major advantages of the proposed approach is the scalability of the problem
due to the decomposable and parallelizable nature. Such an ability could have a
significant impact on the CPU times for training. CPLEX can automatically decompose
and parallelize corresponding problems when the subproblems are convex. In our case,
the training times are proportional to the number of cores as shown in Table 2 tbl3 .
Furthermore, extra relaxations, e.g. Lagrangian relaxation can be combined with the
current decomposable structure in a nested sense using the suggested approach, which
is expected to provide even faster training instants.
3.2 Case Study B
Refineries include complex nonlinear interactions among high number of measured

process variables and unmeasured disturbances. In addition, laboratory measurements
are often limited. Online inferential sensors are needed for the real time optimization
and process control operations addressing smooth transitions between various
operation regimes and quality considerations. A product quality failure, unless inferred
online, might result in significant economic drawbacks once it is observed with a delay
at laboratory. A better inferential performance is obtained only if related features are
included in the empirical model development. As a result, the introduction of high
number of feature variables might contribute to significant over-fitting, exposure to
lack of prediction and increased possibility of sensor failures, since all online
measurements must be available to evaluate a particular empirical equation, including
ANNs. Thus, this case study includes the development of a soft sensor to address the
prediction of an important variable which is measured in the lab.
The data for this case study is obtained through real process measurements from
SOCAR refinery, Turkey. 55 online measurements of temperatures, pressures and flow
rates ( Nelson, 2018 )⁠ in crude distillation unit and nearby unit operations are used for
the prediction of kerosene 95% boiling point, which is an important product
specification and measured in laboratory with low frequency. 50% training ratio is used
for 300 randomly selected data snapshot. More details on the data and the process are
not disclosed for proprietary reasons but a general framework for this typical refinery
operation can be obtained from ( Nelson, 2018 )⁠.
Proposed optimal training and feature selection method is implemented on the selection
of 30%, 50%, and 70% of the features using the aforementioned refinery data. These
percentages are included into the PWL-ANN method given by Problem 2.8 as lower
and upper bounds for the number of features to be selected. Corresponding
performances with reduced feature spaces are compared to that of full feature space
case and the results are presented in Table 3 tbl4 . Please note that results of this case
study exclude the approximation capability analysis of piecewise expressions unlike

the previous case study.
Table 3 includes training and test performances of the PWL-ANN and standard ANN
for selected optimum features and all features. Here, it is again worth mentioning that
optimum features are selected using the PWL-ANN method. The standard ANN
section (ANN) accounts only for the training of a standard ANN with the optimally
selected features. For fair comparison, input weights (w1) are enforced to be the same
for the PWL-ANN and standard ANN. The main aim is to compare the performance of
a standard ANN over the PWL-ANN with the optimum feature space.
First of all, for almost all of the training performances, it is demonstrated that standard
ANN achieves better performance than the PWL-ANN with the selected features.
Similar to Case Study A, this is an expected result since the PWL formulations provide
upper bounds for the original nonlinear programming problems solved for training
theoretically. Nonetheless, in 50% and 70% feature selection, PWL-ANN test
performance is slightly better than that of the standard ANN. Here, it can again be
claimed that this is anticipated because the optimum features are selected using the
PWL-ANN method. Another possible conclusion is that once the number of features
increases, more weight parameters are introduced, which in turn enhances the
probability of overfitting in standard ANN. As a result, PWL-ANN provides a robust
framework due to convexity. Conversely, for 30% case, much less number of features
are selected. Accordingly, number of weight and bias parameters are minimal and it
can be expected that the impact of non-convexity and possibility of overfitting
decrease. Hence, the test performance of the standard ANN is slightly one step ahead
than PWL-ANN, showing the trade-off between approximation and optimization for
the PWL-ANN. On the other hand, for all cases, regardless of PWL or standard ANN,
test performance increases with the selected inputs compared to the all features case,
showing the benefit of the optimum feature selection.
Additionally, note that even small variations in the architecture of a particular
activation function might deliver a better network performance ( Koçak and Üstündağ
Şiray, 2021 )⁠. Thus, exact or very close representation of an activation function does
not ensure a better network performance. On the other hand, for this particular case,
selection of 30% and 50% of the features have a similar ANN training performance. In
contrast, 70% feature selection increases the test error significantly.
Fig. 3.3 fig4 includes the training and test performances using all inputs for the ANN.
Since high number of parameters is available, the training data seems to be represented
perfectly. Nevertheless the test data show significant prediction errors.
Fig. 3.4 fig5 includes the results for 30% of the inputs. Note that, a significant test
performance boost is obtained compared to Fig. 3.3 . In addition, training and the test
errors are relatively closer, unlike Fig. 3.3 , where a poor test performance is displayed
despite almost zero training error.

Finally, please note that the size of the network, e.g. number of neurons and hidden
layers, can be increased once optimum features are selected since standard ANN can
perform well, compared to the full input space, with the optimum features obtained
from the PWL-ANN. Considering reduced number of inputs are selected and this
selection provides better test performance, proposed framework shows prospective
potential to be used for real time optimization and control studies in the refinery.
All in all, two case studies performed in this work show the impact of utilizing a PWL-
ANN feature selection and training method. Both type of ANNs display much better
test performance with the selected optimum features for all percentages. The piece-
wise convex PWL formulation ensures the global optimality in terms of both ANN
parameters and selected features. Thus, despite relatively longer training duration,
obtained optimal results are unique. As a result, successive training procedures for
traditional ANNs, whose practical consecutive amount of steps is challenging to
determine due to non-convexity, are not required with the PWL-ANN.
Piece-wise convexification of the traditional ANN training algorithm is realized in
acceptable approximation levels and with a parallellizable structure, which provides
significant computational potential considering the current multi-core CPUs and
related next generation solvers. In our cases, the speedup in the solution CPU times is
almost linearly proportional to the number of CPU cores. Thus, it could be a
straightforward task to employ larger data sets or calculate higher number of pieces for
the approximation of the nonlinear and non-convex expressions in the original problem
at reasonable time scales. On the other hand, based on our observations, increasing the
number of pieces does not have a guarantee in terms of a better training and feature
selection. In our cases, three pieces of approximations provided satisfactory results for
the hyperbolic tangent function.
One major theoretical challenge with the current implementation is the introduction of
high number of binary variables for both piece-wise linear evaluation and feature
selection. Typically, increasing the number of binary and integer variables in an
optimization framework brings about computational complexity. However, state of the
art solvers including CPLEX and several other alternatives utilize improved relaxation
and decomposition algorithms to deal with mixed integer problems ( Lodi, 2010 )⁠.
Thus, suggested procedure might be preferable over NLPs for specific needs such as
avoiding multi-start phases for training and selecting best features to be used for
learning. In addition, linking constraints are present in the suggested formulation to
prune all the connections once a particular feature is eliminated, which in turn
decreases the computational complexity stemming from larger number of binary
variables. In theory, once a particular column of the input weights is zero, there is no
information flow from that particular input. This is achieved by limiting the value of a
particular weight to zero through the linking constraint to narrow the feasible region
for the computational efficiency, unlike regularization methods where large weight
values are penalized ( Nusrat and Jang, 2018 )⁠. Such traditional regularization methods
often yield low weight values while the connection still exists and might have a
significant sensitivity.
Note that obtained optimum solutions through the PWL-ANN, in general, deliver a
higher training error because of i) approximation and ii) reduced feature space and less
number of parameters in the network. On the other hand, same formulation results in
reduced overfitting at the same time, increasing the generalization capability of both
the PWL-ANN and the standard ANN. Therefore, a better test performance is obtained
once optimum features are selected.
4 Conclusion
This study proposes a mixed-integer linear programming based method for optimal and
global feature selection and training for artificial neural networks on regression
problems. The main idea is to introduce piece-wise linear approximations to convert
the original and non-convex activation (hyperbolic tangent) and objective functions in
traditional feed-forward artificial neural networks, in addition to the introduction of
binary variables to represent the existence of particular features. Furthermore,
suggested reformulations and relaxations are also applicable for all kinds of activation
functions, e.g. sigmoid, RelU, and for classification problems to handle both non-
convexities and discontinuities. The optimal feature selection and training problem for
the ANNs are formulated as a simultaneous and convex optimization problem, whose
optimal solution is unique. This way, the generalization issue and several step trainings
can be avoided. Typically, regardless of the initial guesses, corresponding optimization
problems always terminate on the unique optimum. Furthermore, the convex nature of
the overall method enables the effective and reliable decomposition and parallelization
of the corresponding optimization problems, which is a crucial advantage for the
scalability of the proposed approach for larger data-sets and feature spaces.
The proposed methodology is flexible to many machine learning training algorithms
to address the superstructure related issues since the architecture of machine learning
models, and ANNs in particular, are mostly determined by trial and error procedure
until a satisfactory performance is obtained. On the other hand, the number of
combinations increase significantly once the number of features is high. Furthermore,
with standard methods, significant test errors and practical problems occur due to
overfitting, which in turn calls for selection of unrelated or statistically less
contributing features. Finally, overall convexity of the ANN enables us obtaining
unique solutions for the training. Although not covered in this study, the formulations
are easily extendable to activation function selection and neural network pruning under
global optimality.
Suggested formulation is implemented on two industrial case studies. We demonstrate
that efficient and reliable approximations are obtained even when only a few number
of breakpoints are used for the piece-wise linear approximations. Proposed method
results in significant amount of feature space reduction, which in turn yields increased
test accuracy for the ANNs. Our future work includes the implementation to larger data
sets and classification problems.
Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal
relationships that could have appeared to influence the work reported in this paper.
Commented [A3]: Role attribute not set, Please fix it

Acknowledgements using menu based option..
This publication has been produced benefiting from the 2232 International Fellowship for
Outstanding Researchers Program of TUBITAK (Project No: 118C245). However, the entire
responsibility of the publication belongs to the owner of the publication. The authors thank
SOCAR refinery for providing the real process data for the second case study.
Author Contributions
Authors contributed equally.
Commented [A4]: Please note that we matched the

References supplied reference content against the Crossref.org database
and provided required missing information in the output.
Kindly check the output of all the references.
Atakulreka, A., Sutivong, D., 2007. Avoiding local minima in feedforward
neural networks by simultaneous learning, in: Lecture Notes in Computer
Science (Including Subseries Lecture Notes in Artificial Intelligence and
Lecture Notes in Bioinformatics). Springer, pp. 100–109.
https://doi.org/10.1007/978-3-540-76928-6_12
Atakulreka and Sutivong, 2007 Atakulreka, A., Sutivong, D., 2007. Avoiding
local minima in feedforward neural networks by simultaneous learning, in:
Lecture Notes in Computer Science (Including Subseries Lecture Notes in
Artificial Intelligence and Lecture Notes in Bioinformatics). Springer, pp. 100–
109. https://doi.org/10.1007/978-3-540-76928-6_12
Bliek, C., Bonami, P., Lodi, A., 2014. Solving Mixed-Integer Quadratic
Programming problems with IBM-CPLEX : a progress report, in: Proceedings
of the Twenty-Sixth RAMP Symposium. pp. 171–180.
Bliek et al., 2014 C. Bliek P. Bonami A. Lodi Solving Mixed-Integer Quadratic
Programming problems with IBM-CPLEX : a progress report in: Proceedings
of the Twenty-Sixth RAMP Symposium 2014 171 180
Bunel, R., Lu, J., Turkaslan, I., Torr, P.H.S., Kohli, P., Pawan Kumar, M.,
2019. Branch and bound for piecewise linear neural network verification.
arXiv 21, 1–39.
Bunel et al., 2019 R. Bunel J. Lu I. Turkaslan P.H.S. Torr P. Kohli M. Pawan
Kumar Branch and bound for piecewise linear neural network verification.
arXiv 21 2019 1 39
Bunel, R., Turkaslan, I., Torr, P.H.S., Kohli, P., Pawan Kumar, M., 2017. A
unified view of piecewise linear neural network verification. arXiv 1–10.
Bunel et al., 2017 Bunel, R., Turkaslan, I., Torr, P.H.S., Kohli, P., Pawan
Kumar, M., 2017. A unified view of piecewise linear neural network
verification. arXiv 1–10.
D’Ambrosio, C., Lodi, A., Martello, S., 2010. Piecewise linear approximation
of functions of two variables in MILP models. Oper. Res. Lett. 38, 39–46.
https://doi.org/10.1016/j.orl.2009.09.005
D’Ambrosio et al., 2010 C. D’Ambrosio A. Lodi S. Martello Piecewise linear
approximation of functions of two variables in MILP models Oper. Res. Lett.
Commented [A5]: This reference fetched from

38 1 2010 39 46 10.1016/j.orl.2009.09.005 CrossRef.Org & structured, Check and proceed.
Ding, S., Zhao, H., Zhang, Y., Xu, X., Nie, R., 2015. Extreme learning
machine: algorithm, theory and applications. Artif. Intell. Rev. 44, 103–115.
https://doi.org/10.1007/s10462-013-9405-z
Ding et al., 2015 S. Ding H. Zhao Y. Zhang X. Xu R.u. Nie Extreme learning
machine: algorithm, theory and applications Artif. Intell. Rev. 44 1 2015 103

115 10.1007/s10462-013-9405-z CrossRef.Org & structured, Check and proceed.
Doncevic, D.T., Schweidtmann, A.M., Vaupel, Y., Schäfer, P., Caspari, A.,
Mitsos, A., 2020. Deterministic Global Nonlinear Model Predictive Control
with Neural Networks Embedded. IFAC-PapersOnLine 53, 5273–5278.
https://doi.org/10.1016/j.ifacol.2020.12.1207
Doncevic et al., 2020 D.T. Doncevic A.M. Schweidtmann Y. Vaupel P.
Schäfer A. Caspari A. Mitsos Deterministic Global Nonlinear Model Predictive
Control with Neural Networks Embedded IFAC-PapersOnLine 53 2 2020

5273 5278 10.1016/j.ifacol.2020.12.1207 CrossRef.Org & structured, Check and proceed.
Dua, V., 2010. A mixed-integer programming approach for optimal

configuration of artificial neural networks. Chem. Eng. Res. Des. 88, 55–60.
https://doi.org/10.1016/j.cherd.2009.06.007
Dua, 2010 V. Dua A mixed-integer programming approach for optimal
configuration of artificial neural networks Chem. Eng. Res. Des. 88 1 2010 55

60 10.1016/j.cherd.2009.06.007 CrossRef.Org & structured, Check and proceed.
Dunn, K., 2021. OpenMV.net Datasets [WWW Document]. URL
https://openmv.net/info/distillation-tower (accessed 5.4.21).
Dunn, 2021 Dunn, K., 2021. OpenMV.net Datasets [WWW Document]. URL
https://openmv.net/info/distillation-tower (accessed 5.4.21).
Dutta, S., Jha, S., Sanakaranarayanan, S., Tiwari, A., 2017. Output range
analysis for deep neural networks. arXiv.
Dutta et al., 2017 Dutta, S., Jha, S., Sanakaranarayanan, S., Tiwari, A., 2017.
Output range analysis for deep neural networks. arXiv.
Frenzen, C.L., Sasao, T., Butler, J.T., 2010. On the number of segments
needed in a piecewise linear approximation. J. Comput. Appl. Math. 234,
437–446. https://doi.org/10.1016/j.cam.2009.12.035
Frenzen et al., 2010 C.L. Frenzen T. Sasao J.T. Butler On the number of
segments needed in a piecewise linear approximation J. Comput. Appl. Math.

234 2 2010 437 446 10.1016/j.cam.2009.12.035 CrossRef.Org & structured, Check and proceed.
Güneş Baydin, A., Pearlmutter, B.A., Andreyevich Radul, A., Mark Siskind, J.,
2018. Automatic differentiation in machine learning: A survey. J. Mach. Learn.
Res. 18, 1–43.
Güneş Baydin et al., 2018 A. Güneş Baydin B.A. Pearlmutter A. Andreyevich
Radul J. Mark Siskind Automatic differentiation in machine learning: A survey
J. Mach. Learn. Res. 18 2018 1 43
Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., Bengio, Y., 2016.
Binarized neural networks, in: Advances in Neural Information Processing
Systems. pp. 4114–4122.
Hubara et al., 2016 I. Hubara M. Courbariaux D. Soudry R. El-Yaniv Y.
Bengio Binarized neural networks Advances in Neural Information Processing
Systems. 2016 4114 4122
Kavzoglu, T., Mather, P.M., 1999. Pruning artificial neural networks: An
example using land cover classification of multi-sensor images. Int. J. Remote
Sens. 20, 2761–2785. https://doi.org/10.1080/014311699211796
Kavzoglu and Mather, 1999 T. Kavzoglu P.M. Mather Pruning artificial neural
networks: An example using land cover classification of multi-sensor images

Int. J. Remote Sens. 20 14 1999 2787 2803 10.1080/014311699211796 CrossRef.Org & structured, Check and proceed.
Koçak, Y., Üstündağ Şiray, G., 2021. New activation functions for single layer
feedforward neural network. Expert Syst. Appl. 164.
https://doi.org/10.1016/j.eswa.2020.113977
Koçak and Üstündağ Şiray, 2021 Y. Koçak G. Üstündağ Şiray New activation
functions for single layer feedforward neural network Expert Syst. Appl. 164

2021 113977 10.1016/j.eswa.2020.113977 CrossRef.Org & structured, Check and proceed.
Levasseur, L.P., Hezaveh, Y.D., Wechsler, R.H., 2017. Uncertainties in
parameters estimated with neural networks: Application to strong gravitational
lensing. arXiv 850, L7. https://doi.org/10.3847/2041-8213/aa9704
Perreault Levasseur et al., 2017 L. Perreault Levasseur Y.D. Hezaveh R.H.
Wechsler Uncertainties in parameters estimated with neural networks:
Application to strong gravitational lensing. 850 1 2017 L7 10.3847/2041-

8213/aa9704 CrossRef.Org & structured, Check and proceed.
Lodi, A., 2010. Mixed integer programming computation, in: 50 Years of
Integer Programming 1958-2008. Springer, pp. 619–645.
Lodi, 2010 Lodi, A., 2010. Mixed integer programming computation, in: 50
Years of Integer Programming 1958-2008. Springer, pp. 619–645.
Mangasarian, O.L., 2013. Absolute value equation solution via dual

complementarity. Optim. Lett. 7, 625–630. https://doi.org/10.1007/s11590-
012-0469-5
Mangasarian, 2013 O.L. Mangasarian Absolute value equation solution via
dual complementarity Optim. Lett. 7 4 2013 625 630 10.1007/s11590-012-

0469-5 CrossRef.Org & structured, Check and proceed.
Matias, T., Souza, F., Araújo, R., Antunes, C.H., 2014. Learning of a single-
hidden layer feedforward neural network using an optimized extreme learning
machine. Neurocomputing 129, 428–436.
https://doi.org/10.1016/j.neucom.2013.09.016
Matias et al., 2014 T. Matias F. Souza R. Araújo C.H. Antunes Learning of a
single-hidden layer feedforward neural network using an optimized extreme
learning machine Neurocomputing 129 2014 428 436
10.1016/j.neucom.2013.09.016
Miao, J., Niu, L., 2016. A Survey on Feature Selection. Procedia Comput. Sci.
91, 919–926. https://doi.org/10.1016/j.procs.2016.07.111
Miao and Niu, 2016 J. Miao L. Niu A Survey on Feature Selection Procedia
Comput. Sci. 91 2016 919 926 10.1016/j.procs.2016.07.111
Nawi, N.M., Atomi, W.H., Rehman, M.Z., 2013. The Effect of Data Pre-
processing on Optimized Training of Artificial Neural Networks. Procedia
Technol. 11, 32–39. https://doi.org/10.1016/j.protcy.2013.12.159
Nawi et al., 2013 N.M. Nawi W.H. Atomi M.Z. Rehman The Effect of Data
Pre-processing on Optimized Training of Artificial Neural Networks Procedia
Technol. 11 2013 32 39 10.1016/j.protcy.2013.12.159
Nelson, W.L., 2018. Petroleum refinery engineering. McGraw-Hill.
Nelson, 2018 W.L. Nelson Petroleum refinery engineering 2018 McGraw-Hill
Nusrat, I., Jang, S.-B., 2018. A comparison of regularization techniques in
deep neural networks. Symmetry (Basel). 10, 648.
Nusrat and Jang, 2018 I. Nusrat S.-B. Jang A comparison of regularization
techniques in deep neural networks Symmetry (Basel). 10 11 2018 648

10.3390/sym10110648 CrossRef.Org & structured, Check and proceed.
Plumb, A.P., Rowe, R.C., York, P., Brown, M., 2005. Optimisation of the
predictive ability of artificial neural network (ANN) models: A comparison of
three ANN programs and four classes of training algorithm. Eur. J. Pharm.
Sci. 25, 395–405. https://doi.org/10.1016/j.ejps.2005.04.010
Plumb et al., 2005 A.P. Plumb R.C. Rowe P. York M. Brown Optimisation of
the predictive ability of artificial neural network (ANN) models: A comparison

of three ANN programs and four classes of training algorithm Eur. J. Pharm.

Sci. 25 4-5 2005 395 405 10.1016/j.ejps.2005.04.010 CrossRef.Org & structured, Check and proceed.
Polisetty, P.K., Gatzke, E.P., 2005. A Decomposition-based MINLP Solution
Method Using Piecewise Linear Relaxations 1 Introduction 1–30.
Polisetty and Gatzke, 2005 Polisetty, P.K., Gatzke, E.P., 2005. A
Decomposition-based MINLP Solution Method Using Piecewise Linear
Relaxations 1 Introduction 1–30.
Rister, B., Rubin, D.L., 2017. Piecewise convexity of artificial neural networks.
Neural Networks 94, 34–45. https://doi.org/10.1016/j.neunet.2017.06.009
Rister and Rubin, 2017 B. Rister D.L. Rubin Piecewise convexity of artificial
neural networks Neural Networks 94 2017 34 45
10.1016/j.neunet.2017.06.009
Rosa, J.P.S., Guerra, D.J.D., Horta, N.C.G., Martins, R.M.F., Lourenço,
N.C.C., 2020. Overview of Artificial Neural Networks. SpringerBriefs Appl.
Sci. Technol. 21–44. https://doi.org/10.1007/978-3-030-35743-6_3
Rosa et al., 2020 J.P.S. Rosa D.J.D. Guerra N.C.G. Horta R.M.F. Martins
N.C.C. Lourenço Overview of Artificial Neural Networks SpringerBriefs Appl.
Sci. Technol. 21–44 2020 10.1007/978-3-030-35743-6_3

Sattari, M.A., Roshani, G.H., Hanus, R., 2020. Improving the structure of two-
phase flow meter using feature extraction and GMDH neural network. Radiat.
Phys. Chem. 171, 108725.
Sattari et al., 2020 M.A. Sattari G.H. Roshani R. Hanus Improving the
structure of two-phase flow meter using feature extraction and GMDH neural
network Radiat. Phys. Chem. 171 2020 108725

10.1016/j.radphyschem.2020.108725 CrossRef.Org & structured, Check and proceed.
Schweidtmann, A.M., Mitsos, A., 2019. Deterministic Global Optimization with
Artificial Neural Networks Embedded. J. Optim. Theory Appl. 180, 925–948.
https://doi.org/10.1007/s10957-018-1396-0
Schweidtmann and Mitsos, 2019 A.M. Schweidtmann A. Mitsos Deterministic
Global Optimization with Artificial Neural Networks Embedded J. Optim.

Theory Appl. 180 3 2019 925 948 10.1007/s10957-018-1396-0 CrossRef.Org & structured, Check and proceed.
Shinano, Y., Fujie, T., 2007. ParaLEX: A parallel extension for the CPLEX
mixed integer optimizer, in: Lecture Notes in Computer Science (Including
Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in
Bioinformatics). Springer, pp. 97–106. https://doi.org/10.1007/978-3-540-
75416-9_19
Shinano and Fujie, 2007 Shinano, Y., Fujie, T., 2007. ParaLEX: A parallel
extension for the CPLEX mixed integer optimizer, in: Lecture Notes in
Computer Science (Including Subseries Lecture Notes in Artificial Intelligence
and Lecture Notes in Bioinformatics). Springer, pp. 97–106.
https://doi.org/10.1007/978-3-540-75416-9_19
Sibi, P., Allwyn Jones, S., Siddarth, P., 2013. Analysis of different activation
functions using back propagation neural networks. J. Theor. Appl. Inf.
Technol. 47, 1344–1348.
Sibi et al., 2013 P. Sibi S. Allwyn Jones P. Siddarth Analysis of different
activation functions using back propagation neural networks J. Theor. Appl.
Inf. Technol. 47 2013 1344 1348
Sildir, H., Aydin, E., Kavzoglu, T., 2020. Design of feedforward neural
networks in the classification of hyperspectral imagery using superstructural
optimization. Remote Sens. 12, 956. https://doi.org/10.3390/rs12060956
Sildir et al., 2020 H. Sildir E. Aydin T. Kavzoglu Design of feedforward neural
networks in the classification of hyperspectral imagery using superstructural
optimization Remote Sens. 12 2020 956 10.3390/rs12060956
Sildir, H., Sarrafi, S., Erdal, A., 2021. Data-driven Modeling of an Industrial
Ethylene Oxide Plant : Superstructure-based Optimal Design for Artificial
Neural Networks. Comput. Process Eng. In Press.
Sildir et al., 2021 Sildir, H., Sarrafi, S., Erdal, A., 2021. Data-driven Modeling
of an Industrial Ethylene Oxide Plant : Superstructure-based Optimal Design
for Artificial Neural Networks. Comput. Process Eng. In Press.
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.,
2014. Dropout: a simple way to prevent neural networks from overfitting. J.
Mach. Learn. Res. 15, 1929–1958.
Srivastava et al., 2014 N. Srivastava G. Hinton A. Krizhevsky I. Sutskever R.
Salakhutdinov Dropout: a simple way to prevent neural networks from
overfitting J. Mach. Learn. Res. 15 2014 1929 1958
Storace, M., De Feo, O., 2004. Piecewise-linear approximation of nonlinear
dynamical systems. IEEE Trans. Circuits Syst. I Regul. Pap. 51, 830–842.
https://doi.org/10.1109/TCSI.2004.823664
Storace and DeFeo, 2004 M. Storace O. DeFeo Piecewise-linear
approximation of nonlinear dynamical systems. IEEE Trans Circuits Syst. I

Regul. Pap. 51 4 2004 830 842 10.1109/TCSI.2004.823664 CrossRef.Org & structured, Check and proceed.
Swirszcz, G., Czarnecki, W.M., Pascanu, R., 2016. Local minima in training of
neural networks 1–12.
Swirszcz et al., 2016 Swirszcz, G., Czarnecki, W.M., Pascanu, R., 2016.
Local minima in training of neural networks 1–12.
Vielma, J.P., 2015. Mixed integer linear programming formulation techniques.
SIAM Rev. 57, 3–57. https://doi.org/10.1137/130915303
Vielma, 2015 J.P. Vielma Mixed integer linear programming formulation

techniques SIAM Rev. 57 1 2015 3 57 10.1137/130915303 CrossRef.Org & structured, Check and proceed.
Vujičić, T., Matijevi, T., 2016. Comparative Analysis of Methods for
Determining Number of Hidden Neurons in Artificial Neural Network, in:
Central European Conference on Information and Intelligent Systems. Faculty
of Organization and Informatics Varazdin, pp. 219–223.
Vujičić and Matijevi, 2016 T. Vujičić T. Matijevi Comparative Analysis of
Methods for Determining Number of Hidden Neurons in Artificial Neural
Network, in Central European Conference on Information and Intelligent
Systems. Faculty of Organization and Informatics Varazdin 2016 219 223
Wang, X.G., Tang, Z., Tamura, H., Ishii, M., Sun, W.D., 2004. An improved
backpropagation algorithm to avoid the local minima problem.
Neurocomputing 56, 455–460. https://doi.org/10.1016/j.neucom.2003.08.006

Wang et al., 2004 X.G. Wang Z. Tang H. Tamura M. Ishii W.D. Sun An
improved backpropagation algorithm to avoid the local minima problem
Neurocomputing 56 2004 455 460 10.1016/j.neucom.2003.08.006
Wen, C., Ma, X., 2008. A max-piecewise-linear neural network for function
approximation. Neurocomputing 71, 843–852.
https://doi.org/10.1016/j.neucom.2007.03.001
Wen and Ma, 2008 C. Wen X. Ma A max-piecewise-linear neural network for
function approximation Neurocomputing 71 4-6 2008 843 852

10.1016/j.neucom.2007.03.001 CrossRef.Org & structured, Check and proceed.
Yang, L., Liu, S., Tsoka, S., Papageorgiou, L.G., 2016. Mathematical
programming for piecewise linear regression analysis. Expert Syst. Appl. 44,
156–167. https://doi.org/10.1016/j.eswa.2015.08.034
Yang et al., 2016 L. Yang S. Liu S. Tsoka L.G. Papageorgiou Mathematical
programming for piecewise linear regression analysis Expert Syst. Appl. 44
2016 156 167 10.1016/j.eswa.2015.08.034
Figure 2.1. Different approximations for the hyperbolic tangent function.
Figure 3.1. ANN training (a) and test (b) performances using all features.
Figure 3.2. ANN training (a) and test (b) performances using selected features.
Figure 3.3. ANN training (a) and test (b) performances using all features.
Figure 3.4. ANN training (a) and test (b) performances using 30% of features.
Table 1a. Results of Case Study A without feature selection.
3 neurons 5 neurons
PWL-ANN ANN PWL-ANN ANN
3 pcs. 5 pcs. 7 pcs. 3 pcs. 5 pcs. 7 pcs.
tr MS
0.003 0.003 0.001 0.0008 0.0023 0.0015 0.0002 0.00005
ai E
ni
n MA
0.046 0.046 0.026 0.022 0.0380 0.0300 0.0123 0.00500
g E
MS
0.088 0.093 0.093 0.121 0.0710 0.0722 0.0722 0.080
te E
st MA
0.186 0.189 0.189 0.219 0.2153 0.2212 0.2212 0.231
E
Table 1b. Results of Case Study A with feature selection using three hidden neurons
with three-piece approximation.
PWL-ANN ANN
MSE 0.006 0.003
training
MAE 0.050 0.041
MSE 0.005 0.009
test
MAE 0.049 0.057
Table 2. Computational times for various number of utilized cores using CPLEX
solver.
Number of CPU time
Cores [min]
15 5.55
20 5.10
40 2.50
Commented [A21]: Tool: Caption text is combined.

Table 3. Case Study B results. Check and proceed.
Selected features All

30% 50% 70% Features
PWL- ANN PWL- ANN PWL- ANN ANN
ANN ANN ANN
tr MSE 0.009 0.003 0.004 0.001 0.002 0.002 0.000*
ai
ni MA
n E 0.062 0.038 0.040 0.023 0.025 0.031 0.000*
g
MSE 0.015 0.011 0.010 0.012 0.013 0.025 0.110
te
st MA 0.090 0.080 0.066 0.082 0.080 0.114 0.234
E
*rounded due to small value

Journal Pre-Proofs: Chemical Engineering Science

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Journal Pre-Proofs: Chemical Engineering Science

Uploaded by

Copyright:

Available Formats

Journal Pre-proofs

A Mixed-Integer Linear Programming based Training and Feature Selection

Hasan Sildir, Erdal Aydin

To appear in: Chemical Engineering Science

Received Date: 25 May 2021

© 2021 Published by Elsevier Ltd.

Feature Selection Method for Artificial Neural Networks

using Piece-wise Linear Approximations

Hasan Sildira, Erdal Aydinb,c,d

aDepartment of Chemical Engineering, Gebze Technical University, Kocaeli 41400, Turkey

bDepartment of Chemical Engineering, Bogazici University, Bebek, Istanbul 34342, Turkey

cDepartment of Chemical and Biological Engineering, Koc University, Istanbul 34457,

 A piece-wise linear ANN training is proposed.

piecewise-linear formulations for the efficient approximation of the non-convex activation

programming problem with a global optimum guarantee because of convexity. Suggested

breakpoints. Significant feature space reduction is observed bringing about notable

improvement in test accuracy.

artificial neural networks

piece-wise linear artificial neural networks

An Artificial Neural Network (ANN) realizes the structured combination of relatively

following a similar framework, they suggested global nonlinear model predictive

through a simultaneous training algorithm, which in turn reduces the probability of

breakpoints. Nevertheless, this problem can be avoided using state-of-the-art solvers

2019, 2017 ; Dutta et al., 2017 )⁠.

convexity of the feature selection and pruning problem. A quasi-decomposition method

using rigorous and convex optimization algorithms ensuring uniqueness.

The mathematical representation of a fully-connected feed-forward artificial neural

𝑦 = 𝑓1(𝑤𝑇1𝑓2(𝑤𝑇2𝑢 + 𝑏1) + 𝑏2) (2.1)

𝑦 = 𝑓1(𝑤𝑇1𝑓2(𝑤𝑇2(𝑢 ⊙ 𝑢𝑆) + 𝑏1) + 𝑏2) (2.3)

training if minimum and maximum number of features to be selected are enforced to

an MILP problem using piece-wise linear approximation methods ( D’Ambrosio et al.,

2010 ; Vielma, 2015 )⁠.

Resulting formulation, which is an MILP (PWL-ANN) is given as follows:

3 Results and Discussion

Suggested PWL-ANN framework is implemented on two different case studies to show

Fig. 3.2 fig3 .

Refineries include complex nonlinear interactions among high number of measured

study exclude the approximation capability analysis of piecewise expressions unlike

despite almost zero training error.

Declaration of Competing Interest

Commented [A3]: Role attribute not set, Please fix it

Authors contributed equally.

Commented [A4]: Please note that we matched the

neural networks by simultaneous learning, in: Lecture Notes in Computer

Science (Including Subseries Lecture Notes in Artificial Intelligence and

Lecture Notes in Bioinformatics). Springer, pp. 100–109.

local minima in feedforward neural networks by simultaneous learning, in:

Lecture Notes in Computer Science (Including Subseries Lecture Notes in

Artificial Intelligence and Lecture Notes in Bioinformatics). Springer, pp. 100–

of the Twenty-Sixth RAMP Symposium. pp. 171–180.

Bliek et al., 2014 C. Bliek P. Bonami A. Lodi Solving Mixed-Integer Quadratic

Programming problems with IBM-CPLEX : a progress report in: Proceedings

of the Twenty-Sixth RAMP Symposium 2014 171 180

arXiv 21, 1–39.

Bunel et al., 2019 R. Bunel J. Lu I. Turkaslan P.H.S. Torr P. Kohli M. Pawan

unified view of piecewise linear neural network verification. arXiv 1–10.

Kumar, M., 2017. A unified view of piecewise linear neural network

verification. arXiv 1–10.

approximation of functions of two variables in MILP models Oper. Res. Lett.

Commented [A5]: This reference fetched from

Commented [A6]: This reference fetched from

Mitsos, A., 2020. Deterministic Global Nonlinear Model Predictive Control

with Neural Networks Embedded. IFAC-PapersOnLine 53, 5273–5278.