Professional Documents
Culture Documents
Journal Pre-Proofs: Chemical Engineering Science
Journal Pre-Proofs: Chemical Engineering Science
PII: S0009-2509(21)00838-1
DOI: https://doi.org/10.1016/j.ces.2021.117273
Reference: CES 117273
Please cite this article as: H. Sildir, E. Aydin, A Mixed-Integer Linear Programming based Training and Feature
Selection Method for Artificial Neural Networks using Piece-wise Linear Approximations, Chemical
Engineering Science (2021), doi: https://doi.org/10.1016/j.ces.2021.117273
This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover
page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version
will undergo additional copyediting, typesetting and review before it is published in its final form, but we are
providing this version to give early visibility of the article. Please note that, during the production process, errors
may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Turkey
dKoc University TUPRAS Energy Center (KUTEM), Koc University, Sariyer, Istanbul,
34450, Turkey
Highlights
Abstract
Artificial Neural Networks (ANNs) may suffer from suboptimal training and test performance
related issues not only because of the presence of high number of features with low
statistical contributions but also due to their non-convex nature. This study develops
and objective functions in artificial neural networks for optimal, global and simultaneous
training and feature selection in regression problems. Such formulations include binary
variables to account for the existence of the features and piecewise-linear approximations,
which in turn, after one exact linearization step, calls for solving a mixed-integer linear
formulation is implemented on two industrial case studies. Results show that efficient
approximations are obtained through the usage of the method with only a few number of
Keywords
machine learning
feature selection
mixed-integer programming
1 Introduction
Feed-forward artificial network is a special type of ANNs among many ( Rosa et al.,
2020 ). The information propagates only in forward direction in this architecture and
our study adapts such a formulation throughout this paper. Those ANNs include a
single hidden layer in addition to input and output layers, all of which might be
activated using different activation functions. Flexibility and representative capability
of the ANNs are highly dependent on the inputs, the number of hidden layer neurons
( Vujičić and Matijevi, 2016 ), the selection of activation functions ( Sibi et al., 2013 ),
the training algorithm ( Plumb et al., 2005 ), and data pre-processing ( Nawi et al.,
2013 ). Such high number of hyper-parameters may cause cumbersome training,
which in turn usually lead to test performance issues. In parallel, an extensive amount
of manual trial-error procedure is required until a satisfactory network architecture is
obtained. Furthermore, since training feed-forward and fully connected ANNs calls for
the solution of highly non-convex optimization problems, results of the training
problems might represent local suboptimal solutions. In addition, an ANN also exhibits
non-convex behavior with respect to the features, besides the activation function. Thus,
a rigorous method addressing aforementioned issues is required but challenging to be
obtained theoretically at the same time.
One possible prescription for suboptimal training related issues is to use global solution
methods and solvers for training. On the other hand, note that one vital drawback of
the global non-convex optimization methods is the required computational power and
time. In their pioneering work, Mitsos and Schweidtmann introduced relaxation and
reduction methods for global optimization problems embedded with trained and non-
convex artificial neural networks ( Schweidtmann and Mitsos, 2019 ). Afterwards,
Moreover, it is also emphasized in the literature that there exist only limited cases
where global minimum can be ensured with such non-smooth objective functions,
caused by non-linear interactions among network variables with high number of
suboptimal solutions ( Swirszcz et al., 2016 ). Such sub-optimal solutions usually
hinder the utilization of existing hidden layer neurons. In ( Atakulreka and Sutivong,
2007 ), a removal criteria is introduced to favor more promising ANN applications
2004 ). However, the non-convex nature of the problem is still a major issue in
training and further design considerations including the feature selection ( Sildir et al.,
2020 ).
Piecewise approximations have been widely used in the literature in search for the
global optimization of nonlinear functions including ANN related problems (Storace
Commented [A1]: AUTHOR: Ref(s). 'De Feo, 2004'
and De Feo, 2004; Yang et al., 2016 ). Such an approach converts the traditional and is/are cited in the text but not provided in the reference list.
Please provide it/them in the reference list or delete these
citations from the text.
non-convex nonlinear programming (NLP) problems into mixed integer programming
problems ( Polisetty and Gatzke, 2005 ; Vielma, 2015 ). Wen and Ma used piecewise
linear activation functions for ANNs in the hidden layer and obtained a balance
between the computational simplicity and accuracy ( Wen and Ma, 2008 ). Rister and
Rubin ( Rister and Rubin, 2017 ) applied piecewise linearization on nonlinear functions
within the ANN architecture to obtain piecewise convexity and multi-convexity in each
layer. They also yield that under squared error objective functions, such biconvex and
piece-wise affine formulations may result in far local optimal solutions. Recently, a
similar approach was proposed in ( Bunel et al., 2019 ) where big-M formulations are
included for the verification problem using trained ANNs with RelU (Rectified Linear
Unit) activation functions. On the other hand, please note that the aforementioned
formulations are limited to using the RelU activation function, which is a piecewise
linear function by nature with a discontinuity at the origin. Furthermore, note that using
a purely linear activation function might contribute to poor training performance, and
also introduces non-smoothness to the training problem. Therefore, using and
approximating highly nonlinear activation functions, e.g. hyperbolic tangent, in a
piece-wise linear fashion would be more beneficial for many nonlinear processes in
chemical engineering related problems.
Piece-wise linear formulations require the addition of breakpoints for each piece-wise
linear segment. Afterwards, binary and auxiliary variables are introduced into the
formulation in order to enforce the selection of the proper segments. This way, proper
calculations of the convex (linear) combinations among the adjacent segments can be
realized during the optimization iteration. Accordingly, the original non-convex NLP
problem is approximated into a mixed integer linear programming problem (MILP),
whose optimal solution, if exists, is unique due to its convexity ( D’Ambrosio et al.,
2010 ). The associated issue with this method is that there is discontinuity at the
Binary variables have been also used for the architectural design of ANNs, besides the
piecewise representation of the activation functions. Dua ( Dua, 2010 ) presented a
mixed-integer formulation to account for the existence of connections and hidden layer
neurons in the objective function. Although the selection of the features is not
implemented explicitly, some solutions might yield the elimination of the particular
features once all of their connections to hidden layer is removed by the algorithm. In
( Matias et al., 2014 ), a binary variable is introduced to account for the existence of a
particular feature, enabling simultaneous training and feature selection. Similarly, the
authors recently proposed a promising way for optimal feature selection and
regularization for ANNs using mixed-integer nonlinear programming ( Sildir et al.,
2021 , 2020). However, one bottleneck of the proposed algorithm is the non-
data handling ( Sattari et al., 2020 ) through simultaneous feature selection and training
where ui and yi are the ith input and output data samples, respectively; N is the number
of samples used for training. The dimensions of those parameters depend on the
number of inputs, outputs and neurons, called as hyper-parameters and they are usually
determined using a trial and error procedure.
Statistically irrelevant or unnecessary features can be eliminated from the ANN
architecture using the following notation for the ANN calculations:
which enables the selection of the best features in the optimization. When the value of
this decision variable is equal to zero, the contribution from the particular input is
eliminated. Conversely, when the value of this decision variable is equal to one, the
corresponding feature exists. Here, it should be mentioned that the selection of the best
features may remind the Lasso formulation. On the other hand, unlike the Lasso, the
main idea in this work is to decide on the existence of the connections using binary
variables, instead of continuous ones, which in turn increases the strength of the
formulation.
Forward computation of the network during the training phase for a particular feature
requires the calculation of both continuous weights and bias values. This unfortunately
might be an issue for classical back-propagation and stochastic gradient descent
algorithms. However, automatic differentiation algorithms are quite able to be
employed for such computation ( Güneş Baydin et al., 2018 ), for a particular
architecture since the sensitivities are applicable to continuous variables only. Finally,
the optimal feature selection can be represented as a non-convex optimization problem
containing both continuous and discrete decision variables.
Accordingly, the training and optimal feature selection problem, which is a mixed-
integer nonlinear programming (MINLP) problem is given as follows:
𝑁 2
𝑚𝑖𝑛𝑤1,𝑤2,𝑏1,𝑏2,𝑢𝑠∑𝑖 = 1 (𝑓1(𝑤𝑇1𝑓2(𝑤𝑇2(𝑢𝑖 ⊙ 𝑢𝑆) + 𝑏1) + 𝑏2) ― 𝑦𝑖)
𝑠.𝑡.
𝑁
∑𝑖 = 1 (𝑓1(𝑤𝑇1𝑓2(𝑤𝑇2(𝑢𝑖 ⊙ 𝑢𝑆) + 𝑏1) + 𝑏2) ― 𝑦𝑖)2 ≤ 𝑀𝑆𝐸𝑑𝑒𝑠𝑖𝑟𝑒𝑑, (2.4)
𝑢𝑚𝑖𝑛
𝑆 ≤ ∑𝑢𝑠 ≤ 𝑢𝑚𝑎𝑥𝑆 ,
𝑢𝑆 ∈ {0,1}.
where MSEdesired is the desired mean square error (MSE); usmin and usmax are minimum
and maximum number of features to be selected, respectively.
The objective of Problem (2.4) is to minimize the MSE for training, while a pre-
selected value, MSEdesired, is enforced as an upper limit for the error. This limit can be
estimated by training a standard ANN using the same data-set after scaling this
variable with a value of, for instance 1.1. In general, as more features are employed,
MSE is expected to decrease since number of hyper-parameters increases. However,
at the same time, overfitting possibility increases, especially for lower MSEdesired
values in training. Therefore, the formulation in Problem (2.4) exhibits a trade-off
between overfitting and training error while computing the best features to be
selected throughout the training. This way, the effect of overfitting can be minimized.
Consequently, the trained ANN with the selected features might result in higher
training error than the ANN with all features, while obtaining less test error with
reduced overfitting. Finally, please note Problem (2.4) is equivalent to standard ANN
The first step in the proposed PWL-ANN method is to linearize the original non-
linear activation function, hyperbolic tangent. Secondly, the objective function of the
training problem for ANNs, which is formulated as the sum of the squared error, is
reformulated to be equal to the absolute value of the sum of the errors through piece-
wise linearization. This is required in order to linearize the non-convex objective
function, whose non-convexity is caused by the squaring of the binary variables
associated with the feature selection. Finally, the overall training and feature selection
problem can be solved simultaneously together with the PWL approximation
equations.
Figure 2.1 fig1 illustrates ways of approximating the hyperbolic tangent function for
different number of piece-wise linear segments ( D’Ambrosio et al., 2010 ). More
detailed review on the required pieces for different functions can be found in
( Frenzen et al., 2010 ). Please note that the proposed PWL formulation is generic in
nature and can be implemented to different activation functions e.g. sigmoid or RelU.
The piece-wise linear approximation for the hyperbolic tangent function, tanh(x), can
be implemented using five segments as follows:
0.05𝑥 ― 0.85 𝑖𝑓 ―8 ≤ 𝑥 < ―2,
0.2𝑥 ― 0.56 𝑖𝑓 ―2 ≤ 𝑥 < ―1,
tanh(𝑥) ≈ { 0.76𝑥 𝑖𝑓 ―1 ≤ 𝑥 < 1, (2.5)
0.2𝑥 + 0.56 𝑖𝑓 1 ≤ 𝑥 < 2,
0.05𝑥 + 0.85 𝑖𝑓 2 ≤ 𝑥 ≤ 8.
After assigning the output layer activation function, f1, to be equal to identity, the
training and feature selection problem using five piece-wise linear segments can be
formulated as follows:
𝑁 2
𝑚𝑖𝑛𝑤1,𝑤2,𝑏1,𝑏2,𝑢𝑠𝑣 = ∑𝑖 = 1 ((𝑤𝑇1𝑓′2(𝑞) + 𝑏2) ― 𝑦𝑖)
𝑠.𝑡.
𝑞 = (𝑤𝑇2(𝑢𝑖 ⊙ 𝑢𝑆) + 𝑏1),
―8 ≤ 𝑞 ≤ 8,
―8𝜆1 ― 2𝜆2 ― 1𝜆3 + 1𝜆4 + 2𝜆5 + 8𝜆6 = 𝑞,
―1.29𝜆1 ― 0.96𝜆2 ― 0.76𝜆3 + 0.76𝜆4 + 0.96𝜆5 + 1.29𝜆6 = 𝑓′2(𝑞),
𝜆1 + 𝜆2 + 𝜆3 + 𝜆4 + 𝜆5 + 𝜆6 = 1,
𝛽1 + 𝛽2 + 𝛽3 + 𝛽4 + 𝛽5 = 1,
𝜆1 ≤ 𝛽1,
𝜆2 ≤ 𝛽1 + 𝛽2, (2.6)
𝜆3 ≤ 𝛽2 + 𝛽3,
𝜆4 ≤ 𝛽3 + 𝛽4,
𝜆5 ≤ 𝛽4 + 𝛽5,
𝜆6 ≤ 𝛽5,
𝑣 ≤ 𝑀𝑆𝐸𝑑𝑒𝑠𝑖𝑟𝑒𝑑,
𝑢𝑚𝑖𝑛
𝑆 ≤ ∑𝑢𝑠 ≤ 𝑢𝑚𝑎𝑥𝑆 ,
𝑢𝑆 ∈ {0,1},
𝜆𝑘⩾0,𝑘 = 1,...,6,
𝛽𝑚 ∈ {0,1},𝑚 = 1,...,5,
―40 ≤ 𝑤1,𝑤2,𝑏1,𝑏2 ≤ 40.
where v is the objective function being the sum of squared errors; q is the input for
the approximated hyperbolic tangent function, f’2; λk are the positive auxiliary
variables and βm are the binary variables associated with the breakpoints where m
stands for the linear segments of the approximation. For more detailed information
about formulating the piece-wise linear approximations for non-convex optimization
problems, the reader is referred to ( Vielma, 2015 ).
Problem (2.6) still exhibits non-linear behavior because of the bilinear terms in the
objective function and in the equality constraint for q. On the other hand, the bilinear
equality constraint for q includes the multiplication of a binary and a continuous
variable. Accordingly, after detecting the upper and lower bound values of these
variables either using trial and error or interval arithmetics, these variables can be
disaggregated and linearized in an exact manner. Moreover, the weight matrix w1 can
be set to have only integer values, which is the case in binarized ANNs ( Hubara et al.,
2016 ). This reformulation will enable another exact linearization step to be
employed for the multiplication of w1 and f’2. Another practical approach would be to
train the original ANN and use the optimal results, w1, instead of the decision
variable weight matrix w1. Note that this is a standard procedure for extreme learning
machine ( Ding et al., 2015 ). In this work, for simplicity, we use fixed values for the
weight matrix w1 and employ exact linearization for the multiplication of w2 and us.
Resulting formulation is given in Problem (2.7) .
𝑁 2
𝑚𝑖𝑛𝑤2,𝑏1,𝑏2,𝑢𝑠𝑣 = ∑𝑖 = 1 ((𝑤𝑜𝑇1 𝑓′2(𝑞) + 𝑏2) ― 𝑦𝑖)
𝑠.𝑡.
𝑞 = 𝑤𝑇2𝑢𝑖 + 𝑏1,
―8 ≤ 𝑞 ≤ 8,
―|𝑤𝑙𝑜2 |𝑢 𝑢𝑝
𝑠 ≤ 𝑤2 ≤ |𝑤2 |𝑢𝑠,
―8𝜆1 ― 2𝜆2 ― 1𝜆3 + 1𝜆4 + 2𝜆5 + 8𝜆6 = 𝑞,
―1.29𝜆1 ― 0.96𝜆2 ― 0.76𝜆3 + 0.76𝜆4 + 0.96𝜆5 + 1.29𝜆6 = 𝑓′2(𝑞),
𝜆1 + 𝜆2 + 𝜆3 + 𝜆4 + 𝜆5 + 𝜆6 = 1,
𝛽1 + 𝛽2 + 𝛽3 + 𝛽4 + 𝛽5 = 1,
𝜆1 ≤ 𝛽1,
𝜆2 ≤ 𝛽1 + 𝛽2, (2.7)
𝜆3 ≤ 𝛽2 + 𝛽3,
𝜆4 ≤ 𝛽3 + 𝛽4,
𝜆5 ≤ 𝛽4 + 𝛽5,
𝜆6 ≤ 𝛽5,
𝑣 ≤ 𝑀𝑆𝐸𝑑𝑒𝑠𝑖𝑟𝑒𝑑,
𝑢𝑚𝑖𝑛
𝑆 ≤ ∑𝑢𝑠 ≤ 𝑢𝑚𝑎𝑥
𝑆 ,
𝑢𝑆 ∈ {0,1},
𝜆𝑘⩾0,𝑘 = 1,...,6,
𝛽𝑚 ∈ {0,1},𝑚 = 1,...,5,
―40 ≤ 𝑤2,𝑏1,𝑏2 ≤ 40.
where |w2 | and |w2 | are the absolute values of the lower and upper bounds of the
lo up
decision variable matrix w2. These bounds are decided based on our previous
numerical investigations; values above and below 40 and -40, respectively, did not
provide improvement for the training case using the hyperbolic tangent function.
Nevertheless, interval analysis could also be performed to calculate the effective
bounds for the ANNs. Please note that the new term obtained for the multiplication of
w2 and us after the exact linearization step also acts as a linking constraint since the
corresponding weight value is assigned to be zero once a particular feature is not
selected. This constraint reduces the overall complexity of the original problem as
less number of sensitivity calculations are to be made by the solver. Finally, instead
of using sum of squared values for the objective function as given in Problem 2.7, we
propose to convert the objective to the minimization of the sum of the absolute error
values. This way, the non-linearity in the objective function, stemming from squaring
the binary values, is avoided. Apparently, this reformulation brings about a non-
smooth objective function which is a huge issue for derivative-based optimization.
On the other hand, the absolute value function can be further linearized exactly,
which requires the addition of only a single binary variable to detect whether the
target value is positive or not during the optimization iteration ( Mangasarian, 2013 ).
(SOS) variables. Furthermore, this problem can be efficiently decomposed due to the
convexity of the relaxed subproblems or parallelized distributionally on many
computation cores using CPLEX, which is a crucial property for the scalability of the
proposed method especially for larger data sets ( Bliek et al., 2014 ; Shinano and Fujie,
2007 ).
The first case study includes 253 snapshot measurements from an industrial distillation
tower over 2.5 years, 150 of which are used for the training ( Dunn, 2021 ). The quality
related variable, vapor pressure, is measured in the laboratory and the ultimate output
to be inferred from 26 online measurements which include the temperatures, pressures
and flow rates in from the tower. Due to proprietary reasons, the locations of those
sensors or more details on the process are not disclosed.
Firstly, we tested the approximation performance of the proposed PWL-ANN and
standard ANN by evaluating the training data set with parameters obtained from the
training of the standard ANN using all features, hyperbolic activation function. In other
words, the approximation capability of the PWL-ANN on a standard trained ANN is
examined. The comparison is provided in Table 1a tbl1 for three and five neurons in
the hidden layer. In addition, the number of pieces in the approximation of the
activation function and their coordinates are required for the feedforward evaluation in
the PWL-ANN. Table 1a includes the evaluation of training and test data using PWL-
ANN using 3, 5, and 7 pieces for the approximation of hyperbolic tangent function (see
Commented [A2]: AUTHOR: Please note that Table 1b.
Fig. 2.1 ). was not cited in the text. Please check that the citation
suggested by the copyeditor is in the appropriate place, and
correct if necessary.
Typically, higher number of pieces might be expected to result in better
approximations, but both coordinates of these pieces and original dataset may have
significant impact on the issue. In this particular case, we yield that higher number of
pieces in the approximation does not contribute to significant increase in the validity
of the approximation. On the other hand, increasing the number of pieces might be
expected to provide crucial contribution in terms of approximation once the data are
collected from regions of significant mismatch between the original activation function
and the piecewise representation.
As shown in Table 1a , the PWL-ANN delivers similar training and test performances
using both 3 and 5 number of hidden neurons. Additionally, 3 pieces with given
coordinates (see Fig. 2.1 ) provided satisfactory representation of the activation
function.
Table 1a also reveals that the PWL-ANN structure with the fixed weight values provide
similar training and test performances compared to the standard ANN with hyperbolic
tangent function. This result shows the acceptable approximation capability of the
PWL-ANN, for this particular case. On the other hand, we should here mention that
even better training and test performances are achievable with training the network
using the proposed PWL-ANN architecture since PWL-ANN provides a convex
structure for training. At the same time, a training solution, regardless of being the
global one or not, can also be verified in a rather acceptable error tolerance using the
suggested PWL-ANN method.
After verifying the PWL-ANN for the full input space, optimal feature selection
algorithm is employed. Optimal weight values obtained for the standard ANN were set
as initial guesses for the PWL-ANN training algorithm, which accounts for the feature
selection as well, and the problem is solved using CPLEX in GAMS language to an
absolute and relative tolerance level of 10-3 to reduce the input space with the convex
optimal feature selection. Results are given in Table 1b tbl2 and include the PWL-ANN
and ANN performances using the selected features: TempC2, Temp8, InvTemp1,
InvTemp3 and InvPressure1. Finally, it is worth here to mention that several runs from
different initial guesses can be avoided once the PWL-ANN is used since same training
results are obtained due to the convexity of the formulation.
A major over-fitting indication is the relatively large difference between training and
test performances. Further, higher number of inputs and hidden layer neurons are
expected to contribute to over-fitting since more parameters and connections are
introduced. In such cases, training error, with a high flexibility for fitting, is reduced
significantly but leads to significant test error as shown in Table 1a for ANNs with no
input selection. Table 1b demonstrates that even though standard ANN (ANN) results
in better training performance, which is typical owing to the fact that PWL-ANN
provides an upper bound to the standard ANN training problem due to approximation,
PWL-ANN corresponds to a better test performance using the selected optimum
features. This result can be claimed to be expected because the optimum features are
computed using the PWL-ANN method, not the standard non-convex ANN method.
However, it is also observed that the test performance of the standard ANN increases
significantly using the selected features, showing the contribution of the proposed
method for standard ANNs as well. The results are also presented in Fig. 3.1 fig2 and
Furthermore, extra relaxations, e.g. Lagrangian relaxation can be combined with the
current decomposable structure in a nested sense using the suggested approach, which
is expected to provide even faster training instants.
3.2 Case Study B
Proposed optimal training and feature selection method is implemented on the selection
of 30%, 50%, and 70% of the features using the aforementioned refinery data. These
percentages are included into the PWL-ANN method given by Problem 2.8 as lower
and upper bounds for the number of features to be selected. Corresponding
performances with reduced feature spaces are compared to that of full feature space
case and the results are presented in Table 3 tbl4 . Please note that results of this case
Table 3 includes training and test performances of the PWL-ANN and standard ANN
for selected optimum features and all features. Here, it is again worth mentioning that
optimum features are selected using the PWL-ANN method. The standard ANN
section (ANN) accounts only for the training of a standard ANN with the optimally
selected features. For fair comparison, input weights (w1) are enforced to be the same
for the PWL-ANN and standard ANN. The main aim is to compare the performance of
a standard ANN over the PWL-ANN with the optimum feature space.
First of all, for almost all of the training performances, it is demonstrated that standard
ANN achieves better performance than the PWL-ANN with the selected features.
Similar to Case Study A, this is an expected result since the PWL formulations provide
upper bounds for the original nonlinear programming problems solved for training
theoretically. Nonetheless, in 50% and 70% feature selection, PWL-ANN test
performance is slightly better than that of the standard ANN. Here, it can again be
claimed that this is anticipated because the optimum features are selected using the
PWL-ANN method. Another possible conclusion is that once the number of features
increases, more weight parameters are introduced, which in turn enhances the
probability of overfitting in standard ANN. As a result, PWL-ANN provides a robust
framework due to convexity. Conversely, for 30% case, much less number of features
are selected. Accordingly, number of weight and bias parameters are minimal and it
can be expected that the impact of non-convexity and possibility of overfitting
decrease. Hence, the test performance of the standard ANN is slightly one step ahead
than PWL-ANN, showing the trade-off between approximation and optimization for
the PWL-ANN. On the other hand, for all cases, regardless of PWL or standard ANN,
test performance increases with the selected inputs compared to the all features case,
showing the benefit of the optimum feature selection.
Additionally, note that even small variations in the architecture of a particular
activation function might deliver a better network performance ( Koçak and Üstündağ
Şiray, 2021 ). Thus, exact or very close representation of an activation function does
not ensure a better network performance. On the other hand, for this particular case,
selection of 30% and 50% of the features have a similar ANN training performance. In
contrast, 70% feature selection increases the test error significantly.
Fig. 3.3 fig4 includes the training and test performances using all inputs for the ANN.
Since high number of parameters is available, the training data seems to be represented
perfectly. Nevertheless the test data show significant prediction errors.
Fig. 3.4 fig5 includes the results for 30% of the inputs. Note that, a significant test
performance boost is obtained compared to Fig. 3.3 . In addition, training and the test
errors are relatively closer, unlike Fig. 3.3 , where a poor test performance is displayed
Thus, suggested procedure might be preferable over NLPs for specific needs such as
avoiding multi-start phases for training and selecting best features to be used for
learning. In addition, linking constraints are present in the suggested formulation to
prune all the connections once a particular feature is eliminated, which in turn
decreases the computational complexity stemming from larger number of binary
variables. In theory, once a particular column of the input weights is zero, there is no
information flow from that particular input. This is achieved by limiting the value of a
particular weight to zero through the linking constraint to narrow the feasible region
for the computational efficiency, unlike regularization methods where large weight
values are penalized ( Nusrat and Jang, 2018 ). Such traditional regularization methods
often yield low weight values while the connection still exists and might have a
significant sensitivity.
Note that obtained optimum solutions through the PWL-ANN, in general, deliver a
higher training error because of i) approximation and ii) reduced feature space and less
number of parameters in the network. On the other hand, same formulation results in
reduced overfitting at the same time, increasing the generalization capability of both
the PWL-ANN and the standard ANN. Therefore, a better test performance is obtained
once optimum features are selected.
4 Conclusion
This study proposes a mixed-integer linear programming based method for optimal and
global feature selection and training for artificial neural networks on regression
problems. The main idea is to introduce piece-wise linear approximations to convert
the original and non-convex activation (hyperbolic tangent) and objective functions in
traditional feed-forward artificial neural networks, in addition to the introduction of
binary variables to represent the existence of particular features. Furthermore,
suggested reformulations and relaxations are also applicable for all kinds of activation
functions, e.g. sigmoid, RelU, and for classification problems to handle both non-
convexities and discontinuities. The optimal feature selection and training problem for
the ANNs are formulated as a simultaneous and convex optimization problem, whose
optimal solution is unique. This way, the generalization issue and several step trainings
can be avoided. Typically, regardless of the initial guesses, corresponding optimization
problems always terminate on the unique optimum. Furthermore, the convex nature of
the overall method enables the effective and reliable decomposition and parallelization
of the corresponding optimization problems, which is a crucial advantage for the
scalability of the proposed approach for larger data-sets and feature spaces.
The proposed methodology is flexible to many machine learning training algorithms
to address the superstructure related issues since the architecture of machine learning
models, and ANNs in particular, are mostly determined by trial and error procedure
until a satisfactory performance is obtained. On the other hand, the number of
combinations increase significantly once the number of features is high. Furthermore,
with standard methods, significant test errors and practical problems occur due to
overfitting, which in turn calls for selection of unrelated or statistically less
contributing features. Finally, overall convexity of the ANN enables us obtaining
unique solutions for the training. Although not covered in this study, the formulations
are easily extendable to activation function selection and neural network pruning under
global optimality.
Suggested formulation is implemented on two industrial case studies. We demonstrate
that efficient and reliable approximations are obtained even when only a few number
of breakpoints are used for the piece-wise linear approximations. Proposed method
results in significant amount of feature space reduction, which in turn yields increased
test accuracy for the ANNs. Our future work includes the implementation to larger data
sets and classification problems.
Outstanding Researchers Program of TUBITAK (Project No: 118C245). However, the entire
responsibility of the publication belongs to the owner of the publication. The authors thank
SOCAR refinery for providing the real process data for the second case study.
Author Contributions
https://doi.org/10.1007/978-3-540-76928-6_12
Atakulreka and Sutivong, 2007 Atakulreka, A., Sutivong, D., 2007. Avoiding
109. https://doi.org/10.1007/978-3-540-76928-6_12
Bliek, C., Bonami, P., Lodi, A., 2014. Solving Mixed-Integer Quadratic
Programming problems with IBM-CPLEX : a progress report, in: Proceedings
Bunel, R., Lu, J., Turkaslan, I., Torr, P.H.S., Kohli, P., Pawan Kumar, M.,
2019. Branch and bound for piecewise linear neural network verification.
Kumar Branch and bound for piecewise linear neural network verification.
arXiv 21 2019 1 39
Bunel, R., Turkaslan, I., Torr, P.H.S., Kohli, P., Pawan Kumar, M., 2017. A
Bunel et al., 2017 Bunel, R., Turkaslan, I., Torr, P.H.S., Kohli, P., Pawan
D’Ambrosio, C., Lodi, A., Martello, S., 2010. Piecewise linear approximation
of functions of two variables in MILP models. Oper. Res. Lett. 38, 39–46.
https://doi.org/10.1016/j.orl.2009.09.005
D’Ambrosio et al., 2010 C. D’Ambrosio A. Lodi S. Martello Piecewise linear
Ding, S., Zhao, H., Zhang, Y., Xu, X., Nie, R., 2015. Extreme learning
machine: algorithm, theory and applications. Artif. Intell. Rev. 44, 103–115.
https://doi.org/10.1007/s10462-013-9405-z
Ding et al., 2015 S. Ding H. Zhao Y. Zhang X. Xu R.u. Nie Extreme learning
machine: algorithm, theory and applications Artif. Intell. Rev. 44 1 2015 103
Doncevic, D.T., Schweidtmann, A.M., Vaupel, Y., Schäfer, P., Caspari, A.,
https://doi.org/10.1016/j.ifacol.2020.12.1207
https://doi.org/10.1016/j.cherd.2009.06.007
Dunn, 2021 Dunn, K., 2021. OpenMV.net Datasets [WWW Document]. URL
Dutta, S., Jha, S., Sanakaranarayanan, S., Tiwari, A., 2017. Output range
Dutta et al., 2017 Dutta, S., Jha, S., Sanakaranarayanan, S., Tiwari, A., 2017.
Frenzen, C.L., Sasao, T., Butler, J.T., 2010. On the number of segments
437–446. https://doi.org/10.1016/j.cam.2009.12.035
Frenzen et al., 2010 C.L. Frenzen T. Sasao J.T. Butler On the number of
Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., Bengio, Y., 2016.
Kavzoglu and Mather, 1999 T. Kavzoglu P.M. Mather Pruning artificial neural
https://doi.org/10.1016/j.eswa.2020.113977
Koçak and Üstündağ Şiray, 2021 Y. Koçak G. Üstündağ Şiray New activation
functions for single layer feedforward neural network Expert Syst. Appl. 164
Lodi, 2010 Lodi, A., 2010. Mixed integer programming computation, in: 50
012-0469-5
Matias, T., Souza, F., Araújo, R., Antunes, C.H., 2014. Learning of a single-
https://doi.org/10.1016/j.neucom.2013.09.016
10.1016/j.neucom.2013.09.016
Miao, J., Niu, L., 2016. A Survey on Feature Selection. Procedia Comput. Sci.
Miao and Niu, 2016 J. Miao L. Niu A Survey on Feature Selection Procedia
Nawi, N.M., Atomi, W.H., Rehman, M.Z., 2013. The Effect of Data Pre-
processing on Optimized Training of Artificial Neural Networks. Procedia
Nawi et al., 2013 N.M. Nawi W.H. Atomi M.Z. Rehman The Effect of Data
Plumb, A.P., Rowe, R.C., York, P., Brown, M., 2005. Optimisation of the
three ANN programs and four classes of training algorithm. Eur. J. Pharm.
Plumb et al., 2005 A.P. Plumb R.C. Rowe P. York M. Brown Optimisation of
Rister, B., Rubin, D.L., 2017. Piecewise convexity of artificial neural networks.
Rister and Rubin, 2017 B. Rister D.L. Rubin Piecewise convexity of artificial
10.1016/j.neunet.2017.06.009
Rosa et al., 2020 J.P.S. Rosa D.J.D. Guerra N.C.G. Horta R.M.F. Martins
phase flow meter using feature extraction and GMDH neural network. Radiat.
Sattari et al., 2020 M.A. Sattari G.H. Roshani R. Hanus Improving the
structure of two-phase flow meter using feature extraction and GMDH neural
https://doi.org/10.1007/s10957-018-1396-0
Shinano, Y., Fujie, T., 2007. ParaLEX: A parallel extension for the CPLEX
75416-9_19
Shinano and Fujie, 2007 Shinano, Y., Fujie, T., 2007. ParaLEX: A parallel
extension for the CPLEX mixed integer optimizer, in: Lecture Notes in
https://doi.org/10.1007/978-3-540-75416-9_19
Sibi, P., Allwyn Jones, S., Siddarth, P., 2013. Analysis of different activation
Sildir, H., Aydin, E., Kavzoglu, T., 2020. Design of feedforward neural
Sildir, H., Sarrafi, S., Erdal, A., 2021. Data-driven Modeling of an Industrial
Ethylene Oxide Plant : Superstructure-based Optimal Design for Artificial
Sildir et al., 2021 Sildir, H., Sarrafi, S., Erdal, A., 2021. Data-driven Modeling
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.,
dynamical systems. IEEE Trans. Circuits Syst. I Regul. Pap. 51, 830–842.
https://doi.org/10.1109/TCSI.2004.823664
Swirszcz, G., Czarnecki, W.M., Pascanu, R., 2016. Local minima in training of
neural networks 1–12.
Swirszcz et al., 2016 Swirszcz, G., Czarnecki, W.M., Pascanu, R., 2016.
Wang, X.G., Tang, Z., Tamura, H., Ishii, M., Sun, W.D., 2004. An improved
Wen, C., Ma, X., 2008. A max-piecewise-linear neural network for function
https://doi.org/10.1016/j.neucom.2007.03.001
Yang, L., Liu, S., Tsoka, S., Papageorgiou, L.G., 2016. Mathematical
programming for piecewise linear regression analysis. Expert Syst. Appl. 44,
156–167. https://doi.org/10.1016/j.eswa.2015.08.034
Figure 3.1. ANN training (a) and test (b) performances using all features.
Figure 3.2. ANN training (a) and test (b) performances using selected features.
Figure 3.3. ANN training (a) and test (b) performances using all features.
Figure 3.4. ANN training (a) and test (b) performances using 30% of features.
3 neurons 5 neurons
PWL-ANN ANN PWL-ANN ANN
3 pcs. 5 pcs. 7 pcs. 3 pcs. 5 pcs. 7 pcs.
tr MS
0.003 0.003 0.001 0.0008 0.0023 0.0015 0.0002 0.00005
ai E
ni
n MA
0.046 0.046 0.026 0.022 0.0380 0.0300 0.0123 0.00500
g E
MS
0.088 0.093 0.093 0.121 0.0710 0.0722 0.0722 0.080
te E
st MA
0.186 0.189 0.189 0.219 0.2153 0.2212 0.2212 0.231
E
Table 1b. Results of Case Study A with feature selection using three hidden neurons
with three-piece approximation.
PWL-ANN ANN
MSE 0.006 0.003
training
MAE 0.050 0.041
MSE 0.005 0.009
test
MAE 0.049 0.057
Table 2. Computational times for various number of utilized cores using CPLEX
solver.
Number of CPU time
Cores [min]
15 5.55
20 5.10
40 2.50