CH15D008 Thesis PDF

INTEGRATION OF MACHINE LEARNING AND DOMAIN
KNOWLEDGE FOR ENGINEERING APPLICATIONS
A THESIS
Submitted by
CHINTA SIVADURGAPRASAD
for the award of the degree
of
DOCTOR OF PHILOSOPHY
DEPARTMENT OF CHEMICAL ENGINEERING
INDIAN INSTITUTE OF TECHNOLOGY MADRAS
CHENNAI-600036, INDIA
MAY 2019
THESIS CERTIFICATE
This is to certify that the thesis titled INTEGRATION OF MACHINE LEARNING AND
DOMAIN KNOWLEDGE FOR ENGINEERING APPLICATIONS submitted by Chinta
Sivadurgaprasad, to the Indian Institute of Technology Madras, for the award of the degree
of Doctor of Philosophy, is a bona fide record of the research work done by him under my
supervision. The contents of this thesis, in full or in parts, have not been submitted to any other
Institute or University for the award of any degree or diploma.
Prof. Raghunathan Rengaswamy Prof. Sridharakumar Narasimhan

Research Guide Research Guide
Professor Professor
Dept. of Chemical Engineering Dept. of Chemical Engineering
IIT-Madras, 600 036 IIT-Madras, 600 036
Place: Chennai
Date:
This thesis is dedicated to
My parents (Chinta Vijayalakshmi and Chinta Thavitinaidu), my family members, my Ph.D.
advisor Prof. Raghunathan Rengaswamy, almighty Lord Shiva, my best friend Krishnaveni
HM and the people who triggered and encouraged my interest for the numbers and math
throughout my life.
i
ACKNOWLEDGEMENTS
I am extremely thankful to work under the guidance of Prof. Raghunathan Rengaswamy.
I wish to know the secrets behind his time management and blissful smile even in hard
situations. Every time I knocked his cabin with a problem let it be technical or personal, I came
out with a solution without fail. If god gives me a chance, I would like do another Ph.D. under
his guidance. The crisp and critical suggestions that he provided are very useful for the
completion of this work. I owe for the morale support and technical guidance he provided
throughout my tenure. I also express my sincere thanks to my co-guide Prof. Sridharakumar
Narasimhan, and my doctoral committee members, Prof. MV Saganaraynan, Prof. Preeti
Aghalayam, Prof S Pushpavanam, Prof. R Nagarajan and Prof. A Kannan for their valuable
suggestions.
I am blessed to have Dr Hemanth, Dr Danny, Dr Srinivasan, Dr Reshmi and Mr Maikandan
as my seniors whose critical comments have enlightened my problem solving skills in research.
Especially, the morale and technical support from Dr Hemanth is unforgettable. I am grateful
to Mr Suseendiran and Dr Deepa for the long technical and personal discussions we had
together, which helped me to shape my thesis and my attitude towards the life. I also extend
my gratitude to Mr Abhishek, Mr Deepak, Mr Arun, Mr Faheem, Mr Sathish, Mr
Venkataraman and Dr Amit and other members of SENAI research group for the fun we had
in group meetings. I am very thankful to my friends Mr Yerrayya, Mr Sridhar, Mr Vinayakram,
Mr Santhan, Mr Ravi, Mr Prasanth, Mr Eswar, Mr Siva, Mr Raju, Mr Moulish, Mrs Neha
Aravind, Miss Madhu, Miss Priyanka, Miss Snehal, Dr Surya and Dr Sathyam naidu whose
presence made the campus life cherished and memorable. I am in debt to my best friend
Krishnaveni HM for her constant motivation and support during this journey.
ii
I also wanted to thank my M. Tech supervisor Dr Prakash Kotecha for the enthusiasm he
created in me towards research and optimization. I also wanted to thank Dr Lakshmi, Mr
Ganesh, Mr Sam Mathew and the training team of Gyan Data Pvt Ltd for providing a great
corporate experience during my internship. I thank Mrs. Shashikala and Mrs. Saraswathi and
other office staff of Department of Chemical Engineering, IIT Maras for their help in all the
office related works. I also thank the management of Robert Bosch Centre for Data science and
Artificial Intelligence for providing computational facilities and its members sharing their
experiences about machine learning applications in their respective fields.
Last but not least, I wanted to thank my family members for the morale support and their
belief in me that I can do something great in life. I also want to thank my cousins Mr Ajay, Mr
Hari and Mr Narendra for being there with me in all my thick and thins.
iii
ABSTRACT
Model identification is crucial in chemical process industries for various applications such
as process monitoring, control, etc. In the past few decades, machine learning algorithms are
of interest for modeling due to their ability to identify complex behavior and computational
tractability. Most of these algorithms are purely data-driven thus raising questions about their
physical interpretability. Though knowledge-based or first principles models provide good
interpretability about the process, formulating and solving such models are time-consuming.
In this thesis, we propose different ways of integrating machine learning techniques and
domain knowledge to harness the advantages of both modeling approaches such as ease of
modeling and good physical interpretability. One such framework incorporates sparsity
information in the underlying functional relationships in a principal component analysis
framework for linear model identification. The use of existing first principles model with
machine learning approaches for structure-property predictions is also explored. The proposed
frameworks demonstrate that the performance of existing approaches can be improved and the
applicability of the existing models can be extended to a broad range of systems. Two new
machine learning algorithms, one for regression, and one for classification using prediction
error based fuzzy clustering approach to identify non-linear functional behavior or boundary
between the classes is also proposed in this thesis
KEYWORDS: Machine learning, domain knowledge, hybrid modeling, CSPCA, multiple
model learning, piecewise SVM, drug solubility
iv
TABLE OF CONTENTS
ACKNOWLEDGEMENTS ........................................................................................................................ ii
ABSTRACT ................................................................................................................................................ iv
LIST OF TABLES .................................................................................................................................... viii
LIST OF FIGURES .................................................................................................................................... ix
1. Introduction ...................................................................................................................... 1
1.1 Motivation .............................................................................................................. 2

1.2 Thesis contents ....................................................................................................... 4
1.3 Organization of the dissertation ............................................................................. 5
2. Integration of process information in the PCA framework ......................................... 6
2.1 Literature survey .................................................................................................... 6

2.2 Mathematical foundations ...................................................................................... 8
2.3 PCA for Model Identification ................................................................................ 9
2.4 Model Identification with partially known constraint matrix (cPCA) ................. 11
2.4.1 Model Identification when a subset of linear relations are known ...................... 16
2.5 Model Identification with known model structure (sPCA).................................. 18
2.5.1 Case study 1 ......................................................................................................... 26
2.5.2 Case study 2 ......................................................................................................... 27
2.5.3 Case study 3 ......................................................................................................... 29
2.6 Constraint Structural PCA ................................................................................... 31
2.6.1 ECC Case study ................................................................................................... 33
2.7 Conclusion ........................................................................................................... 36
3. Generalization of first principles derived model using machine learning approaches

to predict drug solubility in binary systems ................................................................ 38
3.1 Literature survey .................................................................................................. 38

3.2 Data preparation and processing .......................................................................... 42
3.3 Feature selection .................................................................................................. 45
3.4 Single model approximations .............................................................................. 48
3.5 Multiple model approximation ............................................................................ 50
3.6 Conclusion ........................................................................................................... 64
v
4. Prediction error based fuzzy clustering approach using statistical analysis for
piecewise linear model identification............................................................................ 66
4.1 Literature review: ................................................................................................. 68

4.2 PE based fuzzy clustering with statistical significance testing ............................ 71
4.3 Efficacy of proposed approach to estimate static multiple linear regression
(SMLR) models ............................................................................................................... 76
4.3.1 SMLR example 1 ................................................................................................. 77
4.3.2 SMLR example 2 ................................................................................................. 78
4.3.3 SMLR example 3 ................................................................................................. 78
4.3.4 SMLR example 4 ................................................................................................. 83
4.4 Efficacy of proposed approach to identify PWARX models ............................... 84
4.4.1 PWARX example 1.............................................................................................. 85
4.4.2 PWARX example 2.............................................................................................. 85
4.4.3 PWARX example with non-linear dynamics ....................................................... 86
4.5 Efficacy of the proposed approach on two real-life case studies ......................... 88
4.5.1 Identification of energy performance of residential buildings ............................. 88
4.5.2 Identification of non-isothermal CSTR model dynamics for control .................. 89
4.6 Conclusion ........................................................................................................... 96
5. Prediction of solvation free energy of Quinone derivatives using machine learning

approaches in a QSPR framework ............................................................................... 97
5.1 Literature survey .................................................................................................. 97

5.2 Group contribution approach ............................................................................. 101
5.3 QSPR based approaches .................................................................................... 104
5.3.1 Single linear model based QSPR ....................................................................... 106
5.3.2 Neural network based QSPR ............................................................................. 107
5.3.3 Multiple model based QSPR.............................................................................. 108
5.4 Conclusion ......................................................................................................... 115
6. An adaptive prediction error based multiple model SVM classifier for binary
classification problems ................................................................................................. 116
6.1 Literature survey ................................................................................................ 116

6.2 Support vector machines .................................................................................... 121
6.3 Multiple model based SVM ............................................................................... 122
vi
6.4 Evaluation of proposed binary classifier............................................................ 125
6.4.1 Synthetic case studies ........................................................................................ 125
6.4.1.1 Case study 1 (second-order polynomial) ................................................... 126
6.4.1.2 Case study 2 (third-order polynomial) ....................................................... 128
6.4.2 Real-life case studies.......................................................................................... 128
6.5 Conclusions ........................................................................................................ 129
7. Conclusions ................................................................................................................... 131
7.1 Incorporation of process information in the PCA framework ........................... 131

7.2 Prediction of drug solubility in binary solvent systems ..................................... 131
7.3 Prediction error based clustering approach with statistical analysis .................. 133
7.4 Prediction of solvation free energy of Quinone derivatives .............................. 133
7.5 Piecewise linear SVM: ....................................................................................... 134
7.6 Future scope ....................................................................................................... 136
REFERENCES ........................................................................................................................................ 137
vii
LIST OF TABLES
Table 2.1 Understanding steps 1-4 of the CSPCA algorithm for case study 2.5.3 .................. 32
Table 3.1 Details of feature selection process using GA ......................................................... 47
Table 3.2 Various efficacy metrics of obtained models using both single model approaches 59
Table 3.3 Various efficacy metrics of multiple models obtained using the modified PE
approach ................................................................................................................................... 59
Table 3.4 MPD metrics of various water + cosolvent systems using several approaches ....... 62
Table 4.1 Original and converged model details of SMLR example 1 ................................... 79
Table 4.2 Original and converged model details of SMLR example 2 ................................... 80
Table 4.3 Model information of SMLR example 3 ................................................................. 80
Table 4.4 Converged model details of SMLR example 3 without statistical analysis ............ 81
Table 4.5 Converged model details of SMLR example 3 ........................................................ 82
Table 4.6 Original and converged models of SMLR example 4 using both FMC algorithms 84
Table 4.7 Original and converged models of PWARX example 1 and 2 using both FMC
algorithms ................................................................................................................................ 87
Table 4.8 Information of original and converged non-linear PWARX models for training data
set ............................................................................................................................................. 91
Table 4.9 Information of converged models and corresponding metrics for prediction accuracy
.................................................................................................................................................. 91
Table 4.10 Details of prediction accuracy of different model identification approaches ........ 95
Table 5.1 All 41 groups that are considered for the case study along with contributions ..... 103
Table 5.2 Features ranking obtained using a stepwise approach for NN-QSPR ................... 108
Table 5.3 Details of the final set of multiple models converged .......................................... 113
Table 5.4 Performance metrics of several approaches for solvation free energy estimation 114
Table 6.1 Accuracy of various approaches on test data set of real-life data sets ................... 129
viii
LIST OF FIGURES
Figure 1.1 Various ways of integrating machine learning and domain knowledge................... 3
Figure 2.1 Flow mixing case study .......................................................................................... 11
Figure 2.2 Euclidean norm of residuals using both approaches .............................................. 19
Figure 2.3 Comparison of model estimates by sPCA and PCA at different SNRs ................. 27
Figure 2.4 Flow network of steam melting process for methanol synthesis plant ................... 28
Figure 2.5 Comparison of model estimates of steam melting process at different SNRs ....... 28
Figure 2.6 Comparison of model estimates at different SNRs for case study 3 ...................... 30
Figure 2.7 Comparison of PCA variants performance at different SNRs for case study 3 ..... 33
Figure 2.8 Flow network of simplified ECC benchmark case study ....................................... 34
Figure 2.9 Comparison of PCA variants performance at different SNRs for ECC case study 35
Figure 2.10 Comparison of PCA variants performance for fault detection ............................. 36
Figure 3.1 Generation-wise best and average fitness values for all the folds .......................... 47
Figure 3.2 Parity plots for general and log solubility predictions using both single model
approaches................................................................................................................................ 49
Figure 3.3 Multiple linear models underlying in different input partitions of data ................. 50
Figure 3.4 Prediction error based Knn strategy to identify a suitable model for a test molecule
.................................................................................................................................................. 55
Figure 3.5 Modified PE based clustering algorithm for drug solubility predictions ............... 60
Figure 3.6 MPD values of all 63 binary systems obtained using multiple models approach .. 61
Figure 3.7 Solubility profiles of two distinct binary systems at various temperatures ............ 63
Figure 4.1 Multiple Model Learning Problem Classification .................................................. 67
Figure 4.2 Flow chart of PE based clustering using variable significance testing for MML .. 75
Figure 4.3 Partition of input data for SMLR example 3(* - C1, o - C2, - C3, + - C4) ......... 82
Figure 4.4 Simulated data - (a) 1000 data samples (training and testing) (b) Global test set .. 93
Figure 4.5 Plots of 𝑦𝑘Vs 𝑦𝑘 − 1 signifying 3 inherent clusters in simulated data for both
outputs ...................................................................................................................................... 94
ix
Figure 4.6 residual error of the global test set using different approaches .............................. 94
Figure 5.1 Group contribution approach framework ............................................................. 102
Figure 5.2 QSPR framework to estimate the property of interest of organic molecules ....... 105
Figure 5.3 Solvation free energy original vs predicted using several approaches ................. 114
Figure 6.1 Some of the widely used machine learning techniques for classification problems
................................................................................................................................................ 120
Figure 6.2 Data distribution and averaged accuracy versus number of models for case study 1
................................................................................................................................................ 127
................................................................................................................................................ 130
Figure 7.1 Generalization of first principles model using machine learning approaches to
predict drug solubility ............................................................................................................ 132
Figure 7.2 Various structure-property relationship based frameworks to estimate the properties
of chemical compounds ......................................................................................................... 135
x
CHAPTER 1
Introduction
In this era of big data, it is anticipated that the performance of process industries can be
considerably improved using advanced machine learning (ML) algorithms. Machine learning
algorithms are also gaining interest in a wide variety of fields such as data compression,
scheduling of operations [1], energy management [2], material informatics [3], manufacturing
industry [4], financial forecasting [5], drug discovery [6], genetics and genomics [7] etc. The
growth of machine learning usage in various fields can be attributed to the amount of data that
is available in the respective fields and the ability of machine learning techniques to identify
complex behavior within a reasonable time. Machine learning techniques can be broadly
categorized into solving two types of problems, i.e. regression and classification. Machine
learning techniques for regression identify mathematical relationships between input features
and outputs, which are, in general, continuous variables like the temperature of the system or
price of a commodity etc. Multivariate linear regression, polynomial regression, piecewise
linear regression, and neural networks are some of the machine learning techniques for
regression. Machine learning techniques for classification identify mathematical relationships
between input features and outputs, which are, in general, categorical or ordinal variables like
gender or color of a person or the type of fault identified in system etc. Logistic regression,
support vector machines, neural networks, and random forests are some of the machine learning
techniques for classification.
1
1.1 Motivation
While the use of ML is increasing, it is argued in the scientific communities that ML
approaches such as neural networks might not provide an interpretable physical representation
of the process. It is also highlighted that data-driven models are valid only in the range of the
input feature space of the collected data, extrapolation using these models may result in
inaccurate predictions [8], [9]. At the same time, though first principles (FP) models have good
physical interpretation, it is challenging and time-consuming to develop the models and also
process knowledge may not be available, a priori, for most complex processes [8], [9]. Thus,
the integration of both first principles and data-based modeling can yield benefits such as ease
of modeling and improved predictions [8], [9]. Frameworks that combine both first principles
and data-driven models are termed as hybrid modeling in process modeling communities.
One of the earliest frameworks to incorporate first principles knowledge in the form of
constraints into machine learning algorithms was proposed by Joerding and Meador [10].
Psichogios and Ungar [11] proposed a hybrid framework to obtain model parameters of a first
principles derived model using neural networks for dynamic modeling of a fed-batch reactor.
Su et al. [12] proposed a modeling framework to integrate first principles and machine learning
approaches in which the residual errors of the first principles model are predicted using neural
networks. They [12] highlighted that such integration will be beneficial provided the
performance of first principles are not adequate and the residuals contain some information
about the process which is captured using machine learning techniques. Milanic et al. [13]
proposed an approach to incorporate domain knowledge in artificial neural networks for
optimizing the quality of a hydrolysis batch process, where first principles models are used to
obtain enormous augmented data and the model structure whereas neural networks are used to
obtain the model parameters. Kahrs and Marquardt [14] proposed two complementary criteria
to validate the applicability domain of hybrid models for process optimization. Any hybrid
2
model that satisfies these criteria is assumed to be robust in the whole operating regime. Stosch
et al. [9] reviewed most commonly used hybrid modeling frameworks and their applications in
process industries for several objectives such as monitoring, optimization, and control, etc.
Integration of ML and FP
ML FP FP ML FP ML
Outputs Inputs Outputs Inputs

First principles
Model Model
First principles First principles
Outputs Inputs
Model Machine learning Machine learning

Machine learning
Inputs
Outputs Inputs Model 1
First principles
Model
Machine learning
Machine learning
Constraints
Model 2
First principles Outputs
Figure 1.1 Various ways of integrating machine learning and domain knowledge
3
In this thesis, we categorize integration of machine learning techniques and first principles
models as depicted in Figure 1.1. The integration can be broadly classified into three major sub
categories i.e. major contribution from ML and a minor contribution from FP, the major
contribution from FP and a minor contribution from ML, and equal contributions from both.
The first category exist in two forms, i.e., obtaining ML models from FP derived inputs [13]
and obtaining ML models with domain constraints from FP [10]. Second category approaches
also exist in two forms, i.e., obtaining model parameters using ML in FP derived models [11]
and modeling the residuals of FP models using ML [12]. The final category approaches
formulate hybrid models specific to domain applications, where the integration is lot more
coupled and application specific. In this thesis, we use some of these integration concepts for
solving real-life engineering problems by combining domain knowledge and machine learning
techniques. It should be noted that enough care should be taken during the modeling of such
integrated models such that the disadvantages by individual techniques are minimized in the
hybrid models. We use principal component analysis (PCA) and multiple model learning
(MML) as ML tools.
1.2 Thesis contents
 Incorporation of process information such as a subset of true constraints or sparse
information of constraint matrix in PCA framework for model identification.
 Identification of model parameters in first principles derived model to estimate drug
solubility in a binary solvent system using machine learning approaches.
 Modified clustering approach that can process feature selection and model
identification together in a piecewise linear modeling framework.
 Machine learning models using first principles derived descriptors, i.e., inputs to
predict solvation free energy of Quinone derivatives for flow battery applications.
 Piecewise SVM approach for binary classification.
4
1.3 Organization of the dissertation
In this dissertation, initially, a brief introduction to various frameworks to integrate domain
knowledge with machine learning techniques is provided in chapter 1. In chapter 2, the
structural information i.e. active variables in each linear relation are incorporated in the PCA
framework to get better model estimates. In chapter 3, the efficacy of multiple model learning
approach is examined in a quantitative structure property relationship (QSPR) framework to
predict drug solubility in binary solvent systems. In this work, a modified version of Jouyban-
Acree model [15] is generalized to predict the solubility of a drug in a given binary solvent
system at a given temperature, in which model parameters are estimated as functions of
structural descriptors/features. In this case study, input features and the model form are
obtained from domain knowledge and to identify the significant input features, genetic
algorithm is used. It is observed in the feature selection phase that some of the input features
are insignificant in particular regions of feature space. In chapter 4, to obtain significant input
features and corresponding model parameters in each partition together, without any
assumptions, a statistical testing based prediction error (PE) clustering approach is proposed.
In chapter 5, the proposed approach is used to predict the solvation free energy of Quinone
molecules as a function of first principles derived input features. In chapter 6, inspired by the
performance of PE based approaches for regression problems, the idea of prediction error and
fuzzy membership is further extended to identify the non-linear boundary between the classes
using piecewise linear SVM models. Finally, in chapter 7, conclusions and possible future
directions are provided. A detailed problem specific literature review is provided in chapters 2
to 6 followed by the respective problem statements.
5
CHAPTER 2
Integration of process information in the PCA framework
Model identification is crucial in process industries for various applications such as process
automation, controller implementation, etc. Neural networks [16], [17], multiple models [18],
[19], principal component analysis [20] are some of the widely used techniques for model
identification in process industries. In most chemical processes, linear models suffice due to
the linearity of the process around steady-state operating conditions and ease of
implementation. Principal component analysis (PCA) is a popular machine learning approach
widely used for dimensionality reduction and data reconciliation in scientific communities.
PCA also used in chemical industries for process monitoring [21], [22], and fault detection and
diagnosis [20], [23]. In chemical industries, it is possible to obtain partial information about
the process states. Information about a subset of model equations or sparsity of the model
structure can be obtained in the form of process flow-sheets and heuristics. In order to get better
estimates of the process model, it is desired to incorporate this useful knowledge in the model
identification exercise.
2.1 Literature survey
Common model identification techniques lack the freedom to incorporate partial process
knowledge. Principal component analysis (PCA), one of the most widely used methods for
linear steady-state model identification, in its vanilla form does not provide the freedom to
incorporate information about model sparsity. Sparse PCA [24], though provides a sparse
representation of the data, does not inherently incorporate the information. It is primarily used
6
to find sparse representations of high dimensional datasets [25], [26]. In a similar way, there
does not exist a formulation to incorporate knowledge in the form of a subset of model
equations governing system dynamics, in conventional methods.
PCA projects a dataset to a lower dimensional subspace, by preserving maximum variations
in the dataset [27], and excluding the minimal variations characterizing them as noise. The
directions of maximum variability, called principal components (PCs), are used to obtain
“useful” variations in the data, making PCA a popular denoising technique[28], [29]. The
directions of minimum variability can be used as directions orthogonal to the dataset, and thus
can be used to obtain a set of model equations for a linear process generating the dataset [27],
[30], [31]. Another approach working along similar lines is network component analysis
(NCA)[32]. NCA tries to utilize the information pertaining to network structure for model
identification. Similar approaches to utilize prior knowledge about the system can be seen in
various domains of engineering. Few of the closely related approaches are robust PCA and its
variants [33]–[36], and sparse PCA and its variants[37], [38]. Most of these approaches have
to sacrifice the simplicity in PCA formulation to incorporate the essential system information.
In this article, we propose algorithms for estimating the entire model, given partial process
knowledge in two particular forms – (i) Cases when a subset of model equations are known,
(ii) Cases when the sparse elements of the model equations are known. As an exemplar, we use
the novel PCA formulation with minimal changes to incorporate the partial information
available for the system. For this purpose, PCA is coupled with variable sub-selection
procedures and is reported to give better estimates of the process model. The rest of the chapter
is organized as follows. Sections 2.2 and 2.3 cover the mathematical foundations and
formulation of PCA required to follow the proposed approaches. Section 2.4 discusses the case
to utilize the information about the subset of known model equations. The proposed algorithm
is termed as constrained PCA (cPCA). We discuss the algorithm to incorporate the sparsity
7
information in section 2.5, which is termed as structural PCA (sPCA). In section 2.6, we
combine the proposed approaches for better estimates in a case of similar structural information
as the previous sections. Finally, we conclude the chapter by highlighting major insights from
the performance of proposed algorithms.
2.2 Mathematical foundations
We start the discussion on the model identification problem for noise-free data. PCA is one
of the most widely accepted approaches for this purpose [27], [39] but our intention lies in
presenting a novel perspective of PCA which is understated in the literature. It will also help
to develop motivation for the proposed method in the next section. Let 𝑥(𝑡) be a 𝑛 × 1 vector
consisting measurements of 𝑛 variables at time instant 𝑡. It is assumed that these 𝑛 variables
are related by 𝑚 linear equations at all time instants. This may be formally stated as
A0 x  t   0m1 t (1.1)
Where 𝐴0 ∈ 𝑅 𝑚×𝑛 is a time-invariant constraint matrix. In this chapter, A or constraint matrix
is interchangeably referred to as model. At each time instant, measurement 𝑦(𝑡) of all the 𝑛
variables is assumed to be corrupted by noise.
y t   x t   e t  t (1.2)
The following assumptions are made on the random errors:
1. e  t  ~   0,  2 I 
2. E  e  j  eT  k     2 j ,k I nn
3. E  x  j  eT  k    0, j , k
Where E . is the usual expectation operator and e  t  is a vector of white-noise errors, with all
elements having identical variance  2 as stated above. We introduce the collection of 𝑁 such
noisy measurements as follows
8
Y   y  0 y 1 y  N  1
T
(1.3)
X   x  0 x 1 x  N  1
T
(1.4)
Given 𝑁 noisy measurements of 𝑛 variables, the objective of the PCA algorithm is to estimate
the constraint model 𝐴0 in Equation(1.1). In the next section, we formally describe theoretically
relevant aspects of PCA and subsequently pursue our problem of interest.
2.3 PCA for Model Identification
PCA or total least squares method can be formulated as an optimization problem described
below to obtain model parameters.
N
min   y  i   x  i    y i   x i 
T
(1.5)
A, x  i 
i 1
AAT  I mm ;
Subjected to (1.6)
A x  i   0m1 ; i  1, ,N
Where 𝐴 is is referred as the model. It is well known that PCA algorithm utilizes the eigenvalue
analysis or equivalently singular value decomposition (SVD) to solve the above optimization
problem[27], [39]. So, we briefly discuss the utilization of novel eigenvalue decomposition for
deriving the model parameters.
1 nn
The sample covariance matrix of 𝑌 is defined as Sy  YY T Sy  (1.7)
N
The eigenvalue decomposition of the sample covariance matrix 𝑆𝑦 is stated as follows:

nn n n
S yU  U  U ,  (1.8)
Where  is a diagonal matrix containing the eigenvalues and 𝑈 consists of the eigenvectors
corresponding to those eigenvalues. If the noise-free measurements (𝑋 in Equation(1.4)) are
accessible, the constraint model can be derived from the eigenvectors corresponding to zero
eigenvalues. This can be intuitively seen by eigenvalue analysis for the covariance matrix of
9
noise-free measurements[39].
1
S xU *  U ** Sx  XX T (1.9)
N
nm
S xU 0*  U 0* 0mm  0nm ; U 0*  (1.10)
A0  U 0* 
T
(1.11)
Where, the columns of U 0* contains the eigenvectors corresponding to zero eigenvalues. For
the noisy measurements in Equation(1.7), the eigenvectors corresponding to “small”
eigenvalues are chosen. For the homoscedastic case, it can be proved that few of the “small”
eigenvalues are equal to each other asymptotically and provide an estimate for noise variance
in each 𝑛 variables. It should be noted that PCA provides a set of orthogonal eigenvectors
which is a basis for the constraint matrix.
It can be easily proved that PCA provides the total least squares (TLS) solution [27] but
doesn’t grant the freedom to include any available knowledge of the process in its formulation.
PCA derives the most optimal decomposition based on statistical assumptions without
incorporating any process information. Ignoring the underlying network structure leads to the
minimum cost function value of PCA in Equation (1.5) but may drive us away from the true
process. On the other hand, reformulating the optimization problem with the inclusion of a
priori knowledge as constraints will lead us to a solution closer to the true process. A similar
approach is adopted in sparse PCA [24], dictionary learning [37], regularization approaches
[40], [41] to derive estimates of improved qualities.
In this section, we briefly discussed PCA and acquired the required background to
understand the proposed algorithms in later sections. In the next section, we discuss the
approach to utilize the information about a set of linear relations to derive the full constraint
matrix/model.
10
2.4 Model Identification with partially known constraint matrix (cPCA)
In this section, we assume the availability of a few linear relationships among 𝑛 variables.
Basically, it is presumed that few rows of the constraint matrix, 𝐴0 in Equation (1.1) are
available. It should be noted that all the linear relationships are not assumed to be known but
instead, only a few of them are available. We propose an algorithm termed as constrained
principal component analysis (cPCA) to utilize the partially known information of constraint
matrix. A simple case-study is considered to illustrate the key idea and assumptions. The
optimization problem for the partially known constraint matrix can be formally stated below.
N
min   y  i   x  i    y i   x i 
T
(1.12)
A, x  i 
i 1
AAT  Ill ;
Subjected to: (1.13)
Af x  i   0m1 ; i  1, ,N
A  mn  m l n l n
Where Af   kn  Af  , Akn  , A (1.14)
 A
It is assumed the (𝑚 − 𝑙) linear equations are known to the user and the rest 𝑙 are to be
estimated. 𝐴𝑓 and 𝐴𝑘𝑛 represents full and known constraint matrix respectively. It should be
noted the first constraint in (1.13) is imposed only on the unknown segment of full constraint
matrix to obtain a unique subspace up to a rotation. Consider a simple flow mixing network
example shown in Figure 2.1.
𝑥1 𝑥2 𝑥3 𝑥4
1 2 3
𝑥5
Figure 2.1 Flow mixing case study
11
This network could be easily seen in various engineering disciplines like electrical circuits or
water distribution in pipelines. The flow balance at each node, at any time instant 𝑡 can be
stated as:
x1  t   x2  t   x5  t   0 Node 1
x2  t   x3  t   0 Node 2 (1.15)
x3  t   x4  t   x5  t   0 Node 3
The model equation of this flow network corresponding to noise-free measurements at three
nodes can be stated as, 𝐴0 𝑋(𝑡) = 0, where,
1 1 0 0 1 
A0  0 1 1 0 0  (1.16)
0 0 1 1 1
X  t    x1  t  x2  t  x3  t  x4  t  x5  t   (1.17)
As stated earlier, the noise-free measurements – 𝑋(𝑡) are not accessible. Instead, we are
supplied the noisy measurements of 𝑋(𝑡), denoted by 𝑌(𝑡) in Equation(1.2). It is assumed that
a collection of 𝑁 such noisy measurements are available as stated in Equation(1.4). The noise
used to corrupt the true measurements is given by e  t  ~   0,  2 I55  , where   0.0909 and
I55 represents an identity matrix of dimension of 5  5 . For this case study, we assume to have
a priori knowledge of the linear relation generated by flow balance on node 1. Therefore,
Akn  1 1 0 0 1 (1.18)
One of the naive approaches would be applying PCA without utilizing the knowledge about
known linear relation. Eigenvalue decomposition of the sample covariance matrix defined in
Equation (1.7) is adopted to obtain the constraint matrix estimate by PCA, denoted by 𝐴̂𝑝𝑐𝑎 .
The eigenvectors corresponding to three smallest eigenvalues provide 𝐴̂𝑝𝑐𝑎 .
12
 0.23 0.49 0.02 0.70 0.46 
Â   0.12 0.49 0.79 0.20 0.30  (1.19)
pca  
 0.74 0.39 0.05 0.32 0.44
It may be argued intuitively that applying PCA directly in the above case by ignoring the
available information will drive the user away from true system configuration. This will be
later used for comparison to the proposed method. We proceed to discuss the proposed
algorithm termed as constrained principal component analysis (cPCA). The objective of this
algorithm is to utilize the available information and estimate only the unknown part of
constraint matrix as formulated in Equations (1.12) and(1.13). For any general known part of
 m l n
constraint matrix Akn  ,
Akn y  t   Akn x  t   Akn e  t   Akn e  t  t (1.20)
For a collection of 𝑁 measurements defined in Equation(1.4), the above may be restated as,
AknY  Akn X  Akn E  Akn E (1.21)
To estimate a basis for the rest of linear relations, we attempt to work with data projected on to
null space of 𝐴𝑘𝑛 . This can be mathematically stated as,
n n  ml   n ml  N

Akn X p  X ; Akn  ,Xp  (1.22)
Where Akn denotes the null space of 𝐴𝑘𝑛 . As the noise-free measurements are not available,
Equation (1.22) is restated as,
Akn X p  Y  E (1.23)

It should be noted that estimating X p given Akn and Y leads to overdetermined set of
equations as there are 𝑛 equations for each set of the (𝑛 − 𝑚 + 𝑙) variables in columns of X p
. This leads to a total of 𝑁 × 𝑛 equations in 𝑁(𝑛 − 𝑚 + 𝑙) variables. An estimate of the
projected data on the null space of 𝐴𝑘𝑛 can thus be obtained in least squares sense.
13
 A  
1
Yˆp   Akn  A 
†  T  T
Y kn Akn kn Y (1.24)
A 

Where Yˆp denotes an estimate of X p and  †
kn
denotes the pseudo-inverse of Akn . The
unknown part of the constraint matrix estimate, denoted by 𝐴 in full constraint matrix 𝐴𝑓
presented in Equation (1.14) can be estimated by applying PCA on projected data Yˆp shown in
Equation(1.24). The sample covariance matrix of projected data can be defined similar to
Equation(1.7),
1 ˆ ˆT  n  m l  n  m l 
S yp  Y pYp ; S yp  (1.25)
N
The eigenvalue decomposition of S yp , as defined earlier, can be written as,
S ypU p  U p  p (1.26)
The eigenvectors corresponding to 𝑙 smallest eigenvalues in  p , call it Ap provides a basis for
the constraint matrix of data in projected space. It should be noted that the original data in 𝑛 -
dimensional space was projected in lower (𝑛 − 𝑚 + 𝑙) - dimensional space to estimate the 𝑙
linear relations.

Aˆ p  U p : ,  n  m  1 :  n  m  p   ;  Aˆ p  l  n  m l 
T
(1.27)
Using the above with Equations (1.22) and(1.24), the following can be stated
Aˆ p X p  0lN (1.28)
ˆ  A
A 
†
X  0l N  AX  0l N (1.29)
p kn
So, the constraint for original n-dimensional space can be obtained from reduced dimensional
space by using
 A  
1
A ˆ  A
ˆA 
†
ˆ
A  T 
Akn A  T
; A l n
(1.30)
p kn p kn kn
14
The full constraint matrix can be obtained as stated in Equation(1.14). Revisiting the flow
mixing case study of 5 variables, the full constraint matrix obtained is stated below. Please note
that 𝐴𝑘𝑛 is specified in Equation(1.18).
 1 1 0 0 1 
 Akn  
Aˆ pca      0.53 0.34 0.14 0.74 0.19  (1.31)
ˆ
 A   0.07 0.42 0.77 0.30 0.36
 
To investigate the goodness of estimates from both methods, we utilize the subspace-
dependence based metric stated in Narasimhan and Shah [31] and briefly mentioned here. The
subspace-dependence metric can be viewed as the distance between the row spaces of the true
(𝐴0 ) and estimated constraint matrix (𝐴̂). The minimum distance of each row of 𝐴0 from the
row space of 𝐴̂ in least squares sense is given by
 
m 1
   A0i  A0i Aˆ T AA
ˆ ˆT Aˆ (1.32)
i 1
The true constraint matrix specified in Equation (1.16) is used to evaluate the accuracy of
estimates obtained by PCA and cPCA specified in Equation (1.19) and (1.31). The subspace
metric defined in Equation (1.32) is used to compare the estimates.
PCA  0.0295, cPCA  0.0197 (1.33)
It may be easily inferred from the subspace dependence metric values that the proposed
algorithm cPCA outperforms PCA. This simple case-study with synthetic data was presented
for the ease of understanding the notations and demonstrating the key idea of cPCA. In this
section, the discussion started from the problem of estimating the constraint matrix when a
subset of linear relations is already specified. Ideally, one could easily formulate this problem
into squared error cost function with appropriate constraints as specified in Equations(1.12)
and (1.13). But, unfortunately, the inclusion of the a priori available linear relations deviates
15
the cPCA optimization problem specified in Equations(1.12) and (1.13) from the standard PCA
optimization problem mentioned in Equations(1.5) and (1.6).
The novel contribution of this work is to wisely utilize the available information about a
subset of linear relations and transforming the original problem stated in (11) to PCA friendly
framework. This rewarding step provides us the freedom to include the prior available
information and also the ease of implementation through the analytical solution by PCA.
Basically, this is performed in two steps. The first one is projecting the data in null space of
known linear relation and the second step is applying PCA in the reduced space. Finally, the
obtained solution is transformed back from reduced to original space. The pseudo code of the
proposed cPCA algorithm is given below. We show the efficacy of the proposed algorithm
over PCA on another multivariable case-study in the next subsection.

Obtain the null space Akn for a given set of (𝑚 − 𝑙) of linear relations among 𝑛 variables.

Obtain the projection of data Yˆp , on to the null space Akn using Equation(1.24)
Apply PCA on the lower dimension projected data Yˆp to obtain Aˆ p .
Transform the estimated Aˆ p in the previous step to original subspace using Equation(1.30).
Construct the full constraint matrix using Equation(1.14).
2.4.1 Model Identification when a subset of linear relations are known
In this case study, we intend to study the goodness of estimates obtained by constrained
PCA algorithm. For this purpose, we consider a system with 5 linear relations among 10
variables. The constraint matrix – 𝐴0 of dimension 5  10 corresponding to 5 linear relations is
chosen randomly. It follows:
A0 x  t   051 ; A0  510
, x t   101
(1.34)
16
Where 𝑥(𝑡) is a column vector containing the noise-free measurements of 10 variables at
time instant 𝑡. Thousand such noise-free measurements are generated of 𝑥(𝑡) using the null-
space of 𝐴0 . It can be inferred from Equation(1.34) that the 𝑋 lies in the null-space of 𝐴0 .
Hence, the noise-free data is generated by a linear combination of the null-space, with the
coefficients chosen randomly. We use Equation (1.2) for generating the noisy data – 𝑌. The
noise can be characterized by e  t  ~   0,  2 I1010  with a standard deviation 0.0113 . Proposed
cPCA algorithm can be applied when a subset of linear constraints are known a priori. In this
case study, we will stepwise increase the number of linear constraints known a priori and
observe its effect on the quality of results. For the purpose of comparison, the constraint matrix
is estimated via traditional approach – PCA and the proposed algorithm – cPCA. For a given
𝐴̂, the reconciled estimate of measurements – 𝑌̂ can be estimated.
The accuracy of the estimated model will be characterized by the 2-norm of error. The error
is usually calculated with respect to noisy measurements due to the availability of noisy
measurements in practical situations but in this case, we calculate the error with respect to true
measurements for the purpose of comparison.
Errmeas  Y  Yˆ Errtrue  X  Yˆ (1.35)

2 2
Where 𝑌̂ will be estimated by PCA and cPCA algorithm. 𝐸𝑟𝑟𝑚𝑒𝑎𝑠 can be also visualized as the
cost function in the PCA algorithm as stated in (1.5). The results by both the algorithms are
presented in Figure 2.2.
It can be inferred from Figure 2.2:
1. PCA algorithm gives a lower cost function value compared to cPCA when the error is
calculated with respect to measurements. This is not surprising as cPCA algorithm has
the same objective function with additional constraints. It should be noticed that as the
17
number of linear relations known a priori is increased, the difference in the cost function
for cPCA and PCA increase as additional constraints are included.
2. It can be inferred from the plot of 𝐸𝑟𝑟𝑡𝑟𝑢𝑒 that constraint matrix estimate by cPCA
algorithm is much closer to the true process compared to PCA. It should be noted that
as more linear relations are supplied, the estimate by cPCA is driven towards true vales.
In this case study, we demonstrated the efficacy of the estimated constraint matrix by the
proposed cPCA algorithm for a complex network. In the next section, we consider a tougher
problem of estimating the model when the structure of the constraint matrix is known instead
of a subset of linear relations as seen in this section.
2.5 Model Identification with known model structure (sPCA)
In this section, we address a more challenging and practical problem of incorporating the
knowledge about the structure of the entire constraint matrix. This essentially means we assume
to have a priori knowledge about the set of variables which indulge to satisfy each linear
relationship. For example, the structure of the constraint matrix for flow mixing case study
presented in Figure 2.1 would be
* * 0 0 *
structure  A0   * * * 0 0 (1.36)
0 0 * * * 
The above structure provides us the essential information about the set of variables combining
linearly at each node of the flow network. This information about which variables are related
by linear relation may be easily available in flow distribution networks [31]. Utilizing this
valuable information in the formulation of the optimization problem (one optimization problem
for each constraint) for estimation of the constraint matrix will lead us to a better solution as
discussed earlier.
18
Figure 2.2 Euclidean norm of residuals using both approaches
19
In this section, we present a novel approach to estimate the constraint matrix of a given
structure without getting drowned into imposing sparsity constraints. The key difference in the
methodology of the proposed algorithm and the existing frameworks is to estimate each row
of the constraint matrix, meaning each linear relation separately rather than the whole
constraint matrix. The linear relations estimated sequentially are stacked together at the end to
construct the entire constraint matrix. This idea of estimating linear relations separately equips
us with considerable freedom to incorporate the structural constraints without diving into
sparsity constraints. Of course, it brings in some new challenges which are addressed in a
detailed manner. In order to demonstrate a wide range of challenges and the proposed remedies,
simple constraint matrices are considered. The first step of the proposed algorithm is
rearranging rows of the constraint matrix structure in ascending order of cardinality for non-
zero elements in each row. We chose an example to skip this step to illustrate the key idea but
this has been explained later in the section.
Linear relations corresponding to each row of constraint matrix structure are separately
estimated via sub-selection of variables. For example in flow mixing case study, the structure
given in Equation(1.16) can be estimated by applying PCA to the subset of variables
participating at each node separately. For instance at node 1 in Figure 2.1, variables 𝑦1 , 𝑦2 and
𝑦5 will be considered.
Ysub1  t    y1  t  y2  t  y5  t   (1.37)
Applying PCA on a collection of 𝑁 measurements of 𝑌𝑠𝑢𝑏1 (𝑡) will deliver us a row vector
𝐴𝑠𝑢𝑏1 of dimension 1 × 3 such that 𝐴𝑠𝑢𝑏1 𝑋𝑠𝑢𝑏1 (𝑡) = 0, where 𝑋𝑠𝑢𝑏1 (𝑡) contains the noise-
free measurements of a sub-selected set of variables commensurate to 𝑌𝑠𝑢𝑏1 (𝑡) in Equation
20
(1.37). It should be noted that estimated constraint row vector will only contain the non-zero
entries corresponding to sub-selected variables. Basically, we mean that the structure will be,
Aˆ sub1   aˆ11 aˆ21 aˆ51  (1.38)
Where 𝑎̂𝑖1 correspond to the coefficient of 𝑖 𝑡ℎ variable. The desired structure for the first row
of the constraint matrix could be constructed by appending zeros at the desired locations as
shown below
Aˆ1   aˆ11 aˆ21 0 0 aˆ51  (1.39)
This procedure could be similarly applied at nodes 2 and 3 in Figure 2.1 to estimate row
constraint vectors 𝐴̂2 and 𝐴̂3 respectively. The entire constraint matrix can be constructed by
stacking the estimated linear relations. The true constraint matrix specified in Equation (1.16)
and subspace dependence metric mentioned in Equation (1.32) is used for the evaluation of the
efficacy of estimated constraint matrix by the proposed algorithm, which we term as structural
principal component analysis (sPCA). Proposed algorithm is tested for 1000 runs of MC
simulations with SNR = 10 and the averaged subspace dependence metric is reported in
Equation(1.40). It can be easily inferred from Equation(1.40) that sPCA estimate is much closer
to the true constraint matrix compared to PCA.
PCA  0.1293, sPCA  0.1188 (1.40)
It is interesting to note from Figure 2.1 that node 1, 2 and 3 can be considered as a single node
to derive the linear relation among variables 𝑥1 and 𝑥4 . So applying traditional PCA may reveal
the linear relation among the variables 𝑥1 and 𝑥4 . Unfortunately, this phenomena creates a
challenging issue which can be dealt with appropriate modification in the sPCA approach
discussed previously. To illustrate this phenomena, let us consider another simple example of
desired constraint matrix stated below:
21
* * * * 0 *
* * * * 0 0 
structure  A0    (1.41)
* 0 * 0 0 0
 
* * * * 0 0
We intend to estimate each linear relation separately starting from the first row of
𝑠𝑡𝑟𝑢𝑐𝑡𝑢𝑟𝑒 (𝐴0 ) specified in Equation(1.41). The sub-selected variables would be
Ysub1  t    y1  t  y2  t  y3  t  y4  t  y6  t   (1.42)
Applying PCA on 𝑁 measurements of 𝑌𝑠𝑢𝑏1 (𝑡) may not deliver us the desired structure
specified in the first row of 𝑠𝑡𝑟𝑢𝑐𝑡𝑢𝑟𝑒 (𝐴0 ) specified in Equation(1.41). This may occur as the
complementary set of zero locations in row 2, 3 and 4 of 𝑠𝑡𝑟𝑢𝑐𝑡𝑢𝑟𝑒 (𝐴0 ) specified in
Equation(1.41) are a subset of the complementary set of zero locations in row 1. It basically
means the idea of applying PCA on sub-selected variables doesn’t guarantee a non-zero
coefficient of the selected variables. Sub-selection only guarantees the zero coefficient of the
discarded variables. Ignoring this fact could lead us to estimate a linear relation corresponding
to the structure specified in row 2 of Equation(1.41) when we intend to estimate the relation
corresponding to the structure of row 1. If we ignore the above scenario and proceed to estimate
2nd row of the constraint matrix with the desired structure by sub-selection of variables, we
may end up in estimating same previously estimated linear relation. This may also lead us to
miss out the first constraint as the variable 𝑥6 (𝑡) will not be sub-selected in any of the
consecutive iterations.
We propose a novel approach to deal with such a scenario. The primary concern was
ambiguity in the estimated relationship to be of the structure we intended. This issue raises
doubts mainly due to the estimation of constraint with more zero entries afterward. Such a case
could be avoided by re-configuring the structure of a given constraint matrix. As we intend to
estimate the constraint with less number of zeros afterward, corresponding rows are pushed
22
down. So, the constraint matrix is re-structured in ascending order of the number of non-zero
locations in each row. The objective of this step is to avoid estimation of the individual
constraints which are already estimated. The constraint matrix in Equation(1.41) can be re-
structured as
* 0 * 0 0 0
* * * * 0 0 
structure  A0    (1.43)
* * * * 0 0
 
* * * * 0 *
This rewarding step ensures obtaining the constraint with lower cardinality of non-zero
elements before compared to constraints with higher cardinality but it still does not resolve the
ambiguity in obtaining same linear relation (constraints with a similar structure) when
constraints of with more number of variables are intended to be estimated. We propose a two-
step remedy which is illustrated as follows:
1. Detection: Such cases could be identified by a rank check of the linear relation obtained
at each step. Let the constraint matrix up to 𝑖 𝑡ℎ row be 𝐴̂𝑖 and the linear relation obtained
from 𝑖 + 1𝑡ℎ row be 𝑎̂𝑖+1 . If we obtain a constraint at 𝑖 + 1𝑡ℎ step which is just a linear
𝐴̂𝑖
combination of previously estimated constraints, then rank of [ ] will be the same
𝑎̂𝑖+1
as rank of ̂
𝐴𝑖 . This idea is used for detection of previously estimated constraint.
2. Identification: It should be noted that the cause for detecting a previously estimated
constraint is the existence of multiple constraints. In order to filter the right constraint
from a set of multiple constraints, the idea of rank check is utilized again. Let the full
row rank constraint matrix estimated up to 𝑖 𝑡ℎ row be 𝐴̂𝑖 . For 𝑖 + 1𝑡ℎ row, we propose
to consider all the eigenvectors instead of one eigenvector corresponding to the
minimum eigenvalue. This is done because the set of all eigenvectors is a superset of all
23
the constraints identified till 𝑖 + 1𝑡ℎ iteration. For example in the 2nd iteration for the
structure provided in Equation(1.43), the subset of variables would be
Ysub 2  t    y1  t  y2  t  y3  t  y4  t   (1.44)
Applying PCA on 𝑁 measurements of 𝑌𝑠𝑢𝑏2 (𝑡) should ideally reveal 3 linear relations
but it is known to us from the given structure that there exist only 2 linear constraints
for this particular row-structure. Those 2 linear relations can be filtered from the 3
constraints using a rank check. The above procedure is formally stated below.
We define the matrix 𝐵̂𝑖+1 which contain the eigenvectors along its rows in 𝑖 + 1𝑡ℎ
iteration. It should be noted that these eigenvectors are arranged along the rows such
that the eigenvalues are increasing with increasing row numbers. Let the dimension of
𝐵̂𝑖+1 be 𝑛𝑖+1 × 𝑛𝑖+1 and its 𝑗 𝑡ℎ row be denoted by 𝑏̂𝑖+1,𝑗 . First, we make the hypothesis
that the 𝑗 𝑡ℎ row of 𝐵̂𝑖+1 - 𝑏̂𝑖+1,𝑗 contains an independent constraint. We define
 Aî 
Aî , j    (1.45)
bî 1, j 
To test the hypothesis, we compare the rank of 𝐴̂𝑖,𝑗 and 𝐴̂𝑖 . If the ranks of both matrices
are equal, then 𝑏̂𝑖+1,𝑗 is rejected, otherwise 𝐴̂𝑖 is updated using Equation(1.46) because
it contains a new relation.
 Aî 
ˆ
Ai    (1.46)
bî 1, j 
The number of constraints to be chosen from this 𝑖 + 1𝑡ℎ iteration will be known from
the given structure. Let it be 𝑚𝑖+1 . So this process of detection and filtering right
constraint is carried out until 𝑚𝑖+1 constraints are identified.
24
The estimated constraint matrix could be easily reconfigured according to the original
specified structure once all the constraints are estimated for the re-structured 𝐴0 . In this section,
we discussed the main theme of sub-selecting variables in the proposed algorithm with the help
of flow-mixing case-study. This example demonstrated the efficacy of the results via the
proposed algorithm. Later on, various challenges and remedies were illustrated with the help
of another constraint matrix. The pseudo code of the proposed sPCA algorithm is provided
below. Three diverse case-studies are presented in the next sub-section to show the utility and
performance of the proposed algorithm.
1. Given the structure of the constraint matrix 𝐴𝑠𝑡𝑟𝑢𝑐𝑡 of dimension (𝑚 × 𝑙) configure it such
that 𝑓(𝑖 + 1) ≥ 𝑓(𝑖); ∀𝑖 ∈ {1, … , 𝑚 − 1} where 𝑓(𝑖) is the number of non-zero elements
in row 𝑖 of 𝐴𝑠𝑡𝑟𝑢𝑐𝑡 . Let the re-configured matrix be 𝐴𝑟𝑒−𝑠𝑡𝑟𝑢𝑐𝑡 . Let 𝐺(𝑗) be the count of the
number of rows in 𝐴𝑟𝑒−𝑠𝑡𝑟𝑢𝑐𝑡 having a similar structure with 𝑗 𝑡ℎ row of 𝐴𝑟𝑒−𝑠𝑡𝑟𝑢𝑐𝑡 . Initialize
𝐴̂𝑒𝑠𝑡,𝑖 = [ ] for iteration 𝑖 = 1.
2. For iteration 𝑖 ≥ 2, perform the structure similarity test of 𝑖 𝑡ℎ and 𝑗 𝑡ℎ rows of 𝐴𝑟𝑒−𝑠𝑡𝑟𝑢𝑐𝑡 ,
where 𝑗 ∈ {1, … , (𝑖 − 1)}. If there is any match, discard the 𝑖 𝑡ℎ row of 𝐴𝑟𝑒−𝑠𝑡𝑟𝑢𝑐𝑡 and revisit
step 2 with 𝑖 = 𝑖 + 1, else proceed to next step.
3. For iteration 𝑖, apply PCA on the sub-selected set of variables from 𝑌 corresponding to the
structure of 𝑖 𝑡ℎ row of 𝐴𝑟𝑒−𝑠𝑡𝑟𝑢𝑐𝑡 . Let the number of sub-selected variables and
measurements matrix be 𝑛𝑠𝑢𝑏,𝑖 and 𝑌𝑠𝑢𝑏,𝑖 respectively. Collect all eigenvectors of the sample
covariance matrix of 𝑌𝑠𝑢𝑏,𝑖 to obtain 𝐴̂𝑠𝑢𝑏,𝑖 .
4. Include zeros in 𝐴̂𝑠𝑢𝑏,𝑖 corresponding to the structure of 𝑖 𝑡ℎ row of 𝐴𝑟𝑒−𝑠𝑡𝑟𝑢𝑐𝑡 to obtain 𝐴̂𝑖 .
Note that the dimension of 𝐴̂𝑖 is 𝑛𝑠𝑢𝑏,𝑖 × 𝑛.
5. Filter the correct linear relations by performing a rank test on constraints identified in
iteration 𝑖. For 𝑘 = {1, … , 𝑛𝑠𝑢𝑏,𝑖 }.
25


Aêst ,i    
rank Aêst ,i  rank Aêst ,i ,k ;

Aêst ,i    Aˆ  (1.47)

  A
est ,i

ˆ  k ,:         
rank Aêst ,i  rank Aêst ,i ,k & nrow Aêst ,i  nrow Aêst ,i 1   G i ;
 i 
 Aˆ 
Where Aêst ,i ,k   est ,i  , Aî  k ,: denotes the 𝑘 𝑡ℎ row of ̂𝐴𝑖 , nrow  Aêst ,i  denotes the number
 Aî  k ,: 
of rows in 𝐴̂𝑒𝑠𝑡,𝑖 and 𝐺(𝑖) is defined in step 1.
Terminate this step if nrow  Aêst ,i   G  i   nrow  Aêst ,i 1  to improve computational
efficiency.
6. Repeat the entire procedure from step 2 until nrow  Aêst ,i 1   m
7. Map the estimated constraint matrix to the original form supplied by the user in step 1.
2.5.1 Case study 1
This is a synthesised case study to show the efficacy of the proposed approach when the
structure of the constraints are known. The original constraints and the structural information
of the same are given below. Constraint matrix consists of six variables, in which two variables
are out of the scope (i.e. absent) for the constraints considered in this case study.
1 1 0 0 0 0  * * 0 0 0 0
A0   1 2 3 0 0 0  structure  A0   * * * 0 0 0  (1.48)
3 1 1 2 0 0  * * * * 0 0 
To compare the proposed sPCA approach with the traditional PCA, 500 MC simulations
have tested for SNR values 10, 20, 50, 100, 200, 500, 1000 and 5000. For each MC simulation
at each SNR value, data is generated for 1000 random samples. Sub-space dependence metric
is evaluated for each constraint matrix and is averaged at each SNR value. These metric values
are reported in Figure 2.3, it can be observed from the figure that including process information
available can improve the estimates.
26
Figure 2.3 Comparison of model estimates by sPCA and PCA at different SNRs
2.5.2 Case study 2
The system considered in this case study is the steam melting procedure, which is considered
by many researchers for testing data reconciliation and gross error detection approaches [31],
[42], [43]. The network contains 28 flow variables and 11 flow constraints. The data is
generated by varying 17 flows (F4, F6, F10, F11, F13, F14, F16 - F22, F24, F26 - F28)
independently using a first order ARX model for 1000 time samples, the flow rates of
remaining flows are obtained by using the flow constraints at each time sample. The flowsheet
of the steam melting process can be seen in Figure 2.4.
Assuming the structure of the plant is known, flow constraint matrix is estimated using both
PCA and sPCA for 1000 runs of each SNR value. The mean closeness measure of the
constructed constraint matrices to the original matrix for different SNR values are provided in
27
Figure 2.5. It is interesting to note that except for SNR 10, sPCA delivers better estimates than
PCA in all 1000 runs.
Figure 2.4 Flow network of steam melting process for methanol synthesis plant
Figure 2.5 Comparison of model estimates of steam melting process at different SNRs
28
2.5.3 Case study 3
We intend to show the supremacy of model estimates obtained by sPCA algorithm in this
simulation study. We consider the system with constraint model mentioned in Equation(1.41).
The model is assumed to be 𝐴0 𝑋 = 0 where
 3 1 1 2 0 6  * * * * 0 *
 2 1 2 1 0 0  * * * * 0 0 
A0   structure  A0    (1.49)
1 1 1 0 0 0 * * * 0 0 0
   
1 3 1 1 0 0  * * * * 0 0
It should be noted that the structure of 𝐴0 in Equation(1.49) matches with structure specified
in Equation(1.41). Data is generated with the same procedure followed in section 2.4.1. We
perform MC simulations of 100 runs at various signal to noise ratio (SNRs) to demonstrate the
goodness of estimates obtained by proposed algorithm – sPCA. For the purpose of comparison,
the model was estimated from the PCA algorithm too and subspace dependence metric defined
in Equation(1.32) is used to evaluate the quality of obtained estimates. The averaged subspace
dependence metric for both algorithms at each SNR value can be seen in Figure 2.6.
From the plot, it can be easily noticed that sPCA outperforms PCA at SNR above 50. In
order to improve the performance at other noise levels, we propose the combination of cPCA
and sPCA in the next section. The key idea is to incorporate for the constraints with similar
structure during model estimation. For example, row 2 and 4 in Equation(1.49) have the same
structure. This information is used to modify the proposed algorithm slightly.
29
Figure 2.6 Comparison of model estimates at different SNRs for case study 3
30
2.6 Constraint Structural PCA
Structural PCA performed better than PCA when the structural information of the network
is known. cPCA also showed better performance compared to PCA when one or more true
equations information is known (or obtained). In this section, we propose a combination of
cPCA and sPCA algorithms, termed as CSPCA, which shows improvement over sPCA. We
have discussed the approach of estimating each linear relation corresponding to a structure
separately in section 2.5. All these linear relations were estimated independently in a sequential
manner. The key idea in this section is to utilize the information derived up to the 𝑖 − 1𝑡ℎ row
of the model for estimating the 𝑖 𝑡ℎ row.
This combined algorithm can be utilized in presence of repeated constraints (i.e. two or more
equations involving the same set of variables) or sub-structured constraints (i.e. the variables
set involved in an equation is a subset of the variables set involved in another equation) in the
structural information that is available. It is interesting to note that in the absence of repeated
or sub-structured constraints in the structural information provided this algorithm results same
as sPCA. The pseudo code of the algorithms is as follows:
1. Arrange the constraints in ascending order of the variables that are involved in individual
equations.
2. For all constraints 1 to 𝑚, identify the variables set 𝜑𝑖 that are active in each constraint i.e.
𝜑𝑖 = {𝑗 | 𝐴(𝑖, 𝑗) ≠ 0}.
3. Now for each constraint 𝑖, identify the constraints (𝑗 from 1 to 𝑖 − 1) such that 𝜑𝑗 is a subset
of 𝜑𝑖 and store the sub-structured constraints indices set 𝜓𝑖 i.e. 𝜓𝑖 = {𝑗 | 𝜑𝑗 ⊆ 𝜑𝑖 ; ∀𝑗 =
{1, … , (𝑖 − 1)}}
31
4. Now for each constraint 𝑖, if the sub-structured constraints indices set 𝜓𝑖 is empty then label
the equation as 𝑆 else 𝐶 i.e. 𝐿𝑎𝑏𝑒𝑙𝑖 = {𝑆: |𝜓𝑖 | = 0 𝑒𝑙𝑠𝑒 𝐶}
5. Now, for all the constraints that are labeled as 𝑆 estimate the constraints using sPCA by
using structural information of individual constraints.
6. Now, for all the constraints that are labeled as 𝐶 estimate the constraints using cPCA,
assuming the estimated constraints set in 𝜓𝑖 as known.
7. Rearrange the equations in the given order and report the final estimated A
Steps 1-4 in the above algorithm are performed to detect the constraints which could be
identified using sPCA and cPCA. For the case study described in section 2.5.3, steps 1-4 are
performed and tabulated in Table 2.1 for a better understanding of the proposed CSPCA
algorithm. The efficacy of the proposed algorithm on the case study described in section 2.5.3
is provided in Figure 2.7. It can be observed from Figure 2.7 that CSPCA outperforms sPCA
and PCA at various noise levels.
Table 2.1 Understanding steps 1-4 of the CSPCA algorithm for case study 2.5.3
Rearranged Sub-structured
Constraint Variables set 𝜑𝑖 Label
index constraints set 𝜓𝑖
1 [1, 1, -1, 0, 0, 0] {1,2,3} {} 𝑆
2 [2, 1, -2, 1, 0, 0] {1,2,3,4} {1} 𝐶
3 [1, -3, 1, 1, 0, 0] {1,2,3,4} {1,2} 𝐶
4 [3, 1, -1, 2, 0, -6] {1,2,3,4,6} {1,2,3} 𝐶
32
Figure 2.7 Comparison of PCA variants performance at different SNRs for case study 3
2.6.1 ECC Case study
This system is a simplified version of the Eastman Chemical Company benchmark case
study to test process control and testing methods [44]. It involves 10 flows and 6 flow
constraints, hence the data is generated by varying 4 flows (F1, F5, F7, and F8) for 1000 time
samples. F1 and F2 are mixed streams of reactants A and B with different compositions, F9
and F10 are pure streams of reactant A and B respectively. F3 is a product stream with excess
reactants A and B, which are separated using a separator. F4 is a pure product stream, whereas
F9 and F10 are recycle streams of components A and B. The flow network along with the flow
constraints can be observed from Figure 2.8.
33
Figure 2.8 Flow network of simplified ECC benchmark case study
The last flow constraint is a material balance constraint of component A at J1. Assuming
the structure of the process is known, the flow constraint matrix is estimated for 1000 runs of
MC simulations using PCA, sPCA, and CSPCA for different SNRs. The sub-space dependence
metric of the constructed constraint matrices to the original matrix for different SNR values are
provided in Figure 2.9 along with the number of times an algorithm have better comparison
metric.
The flow constraint matrices constructed using different algorithms tested to identify the
faults in the flow rates of all flows. For illustration, if the flow rates at particular time violate
the constraint matrices (sum of the residuals) within a tolerance limit then the sample
considered to be faulty. For different SNR values (10, 20, 50, 100, 200, 500, 1000 and 5000),
randomly 50 noise added data samples are selected and in each sample, one of the variables is
randomly modified to make the sample faulty. The flow constraint matrices obtained for the
1000 runs of MC simulations for each SNR value are averaged and considered as the final set
of flow constraints. The final set of flow constraints obtained using proposed approaches have
been tested to identify the faults with a tolerance limit as 1. The number of original faults,
which are obtained using the original constraint matrix for the same tolerance are reported in
Figure 2.10 along with the number of faults identified using proposed approaches. It can be
observed from the figure that CSPCA performs better than sPCA, which is superior to PCA.
34
Figure 2.9 Comparison of PCA variants performance at different SNRs for ECC case study
35
Figure 2.10 Comparison of PCA variants performance for fault detection
2.7 Conclusion
In this study, we have formulated model identification schemes, of process models with
known structure. To the best of our knowledge, this is the first time such a scheme has been
proposed. Implementation of the techniques in a few cases suggest an improvement over
conventional PCA. Any modification to the process knowledge can be incorporated by either
sub-selection of variables or reducing the dimensionality of the variable set by using the said
process knowledge. We also proposed the model identification algorithm for the case when
few of the linear relations are known apriori. This was termed as constrained PCA. We
proposed the combination of cPCA and sPCA which provided further improvement in
performance as compared to vanilla PCA and sPCA. The key idea in the integration of two
algorithms was to use the information provided by previously estimated linear relations for
36
estimating further relations. We have also provided general guidelines about the applicability
of the combined algorithm.
In this chapter, we proposed different ways of incorporating process information such asa
subset of constraints or sparse information of whole constraint matrix into the PCA framework
for better model estimates. Sparsity information of a chemical process can be obtained using
first principles models i.e. mass and energy balances. We also demonstrated that the proposed
approaches can identify faults effectively compared to traditional PCA. In the next chapter, we
generalize a first principles derived model to estimate drug solubility in binary systems using
machine learning approaches.
37
CHAPTER 3
Generalization of first principles derived model using machine
learning approaches to predict drug solubility in binary systems
Drug solubility is a major concern in pharmaceutics, for both drug delivery and discovery.
In drug delivery, it is important to achieve the desired concentration of drug in circulation for
achieving a required pharmacological response. Since the drug can reach the receptors through
aqueous media, aqueous soluble drugs are preferred for clinical purposes due to the concern of
oral bioavailability. Solubility also plays a crucial role in discovery and development
investigations. In the last decade, the number of drugs which fail in commercialization is
increasing due to their low aqueous solubility values [45]. Owing to this fact, several
approaches have been proposed to increase the drug solubility such as usage of cosolvents [46],
Deep Eutectic Solvents (DES) [47], solubilizing agents [48], pharmaceutical salts and co-
crystals [49], and various other techniques [50], [51]. Di et al. [52] and Williams et al. [53]
highlighted the major challenges faced by low solubility drugs such as vitro and vivo
assessments during drug discovery and development phases and suggested some possible
remedies.
Jorgensen and Duffy [54] made an attempt to model the aqueous solubility of drugs as a
linear function of features such as solvent-accessible surface area, number of hydrogen bonds,
etc., which are obtained using Monte Carlo simulations. In a follow-up paper [55], they quoted
three different ideas to predict the aqueous solubility of drugs. First was a group contribution
38
method, where linear regression is performed on available data to obtain contributions of each
fragment. Second, a linear regression based QSPR approach where features are obtained from
a chemical structure is proposed. Third, a neural network (NN) based approach which accounts
for non-linear behavior of features in QSPR approach is evaluated. Though NN based approach
allows identification of non-linear behavior, the drawback is that the internal processing of the
NN is not lucid. Ran and Yalkowsky [56] verified the effectiveness of general solubility
equation (GSE) to predict aqueous solubility, which merely requires information of the melting
point and octanol-water partition coefficient of a drug to predict its aqueous solubility. Delaney
[57] reviewed several approaches to predict aqueous solubility using structural information of
the solute and highlighted the challenges in their applicability. Lusci et al. [58] demonstrated
the benefits of machine learning techniques such as deep learning networks over other state-
of-the-art methods used to predict the aqueous solubility of drugs.
Cosolvency is one of the most feasible solutions for increasing aqueous solubility of drugs
[46]. Cosolvency of non-aqueous solvents is also important during the drug development phase
[59]. The major advantage of mixed solvent systems is that the solvent-solvent interactions in
some compositions may allow more solute to dissolve than the single solvent systems [60].
One of the earliest models to predict drug solubility in water-cosolvent systems was proposed
by Yalkowsky [61]. Drawing inspiration from the thermodynamic mixing model, Acree [62]
derived a mathematical model to predict solute solubility in binary solvent systems at a fixed
temperature when the solubility values in pure solvents are available. Jouyban and Hanaee [63]
provided a better strategy for regressing the variables specified in the model proposed by Acree
[62] with a no intercept linear model. This mixed solvent model is further extended by Jouyban
and Acree [15] for varying temperatures. Chen and Song [64] proposed a Nonrandom Two-
Liquid Segment Activity Coefficient (NRTL-SAC) model to estimate drug solubility in pure
and mixed solvents using the molecular descriptors such as hydrophobicity, polarity, and
39
hydrophilicity in a thermodynamic framework. Mullins et al. [65] proposed COSMO-based
(Conductor-like Screening Models) thermodynamic models to predict solubility in both pure
and mixed solvent systems. Sheikholeslamzadeh and Rohani [66] estimated the solubility of
three different solutes in different mixed solvents systems using both experimental and
computational studies and concluded that the NRTL-SAC model performs better than the other
thermodynamic frameworks. Kokitkar and Plocharczyk [67] applied the NRTL-SAC model to
identify optimal solvents to support crystallization process. Shu and Lin [68] improved the
efficacy of COSMO-SAC (segment activity coefficient) model by minimizing the error in the
prediction of solute-solvent interactions using the solubility data in pure solvents. Valavi et al.
[69] extended the NRTL-SAC model by introducing temperature dependent binary interaction
parameters, which results in better prediction of solubility values than the original NRTL-SAC
model.
Jouyban et al. [70] proposed a cosolvency model by consolidating different cosolvency
models proposed as a power series of the volume fraction of the cosolvent but with different
assumptions. Jouyban [59] compared several cosolvency models based on multiple prediction
accuracy criteria such as root mean square error (RMSE), mean percentage deviation (MPD),
etc. and concluded that Jouyban-Acree model [15] to be the most suitable approach for
pharmaceutical purposes. Jouyban et al. [71] made an attempt to generalize the Jouyban-Acree
model with Abraham’s solvent and solute parameters to predict drug solubility in some water-
cosolvent binary systems. This solute generalization approach considers solvents to be fixed,
i.e., one model for each binary solvent system regardless of the solute used. A detailed review
of state-of-the-art methods for estimating drug solubility using both experimental and
computational studies have been consolidated by Jouyban and Fakhree [72].
QSPR is a mathematical relationship identified between the physical response of a molecule
and its structural information. Structural information is denoted in the form of
40
descriptors/features which are numerical values associated with the chemical constitution of a
molecule structure ranging from atom counts to topological surface area. QSPR approaches
proved their ability to predict various physical properties of a molecule from its structural
information [73]. Identifying QSPR involves three major steps, i.e., data preparation, data
processing, and model interpretation. Data preparation involves the conversion of chemical
structures into a suitable form to calculate structural feature values. Structural features can be
obtained from derived mathematical models, experimental analysis and various platforms
designed for this purpose such as PaDEL-Descriptor, DRAGON, OpenBabel [74], etc. Data
processing is used for the removal of intercorrelated features and identifying optimal feature
set using any feature selection algorithm such as genetic algorithm, stepwise algorithm, etc.
Feature selection algorithms are intelligent ways of exploring various possible combinations
of overall feature set to obtain the most suitable subset of features [75]. Identified feature subset
is exposed using any linear or non-linear modeling toolbox to obtain the relationship between
features and response. Interpretation of model requires knowledge about the behavior of
features and the response [74]. Roy et al. [76] provided an overview of QSPR/QSAR modeling
by consolidating the details of various structural descriptors along with the QSAR applications.
Yousefinejad and Hemmateenejad [77] reviewed various chemometric approaches used for
features selection and model development for QSPR studies.
Selecting a suitable mathematical tool for identifying linear or non-linear behavior is not a
trivial task. Multiple linear regression (MLR) and partial least squares are the most efficient
methods to predict linear behavior, whereas neural networks and support vectors are efficient
to predict non-linear behavior. The disadvantage of neural networks is that the model will not
be available explicitly. Multiple model identification is one of the alternatives to identify non-
linear behavior, using piecewise linear models [78]. A detailed review of the state-of-the-art of
multiple model framework can be obtained in the recent two-part review [79], [80]. Among
41
several cosolvency models that have been proposed to predict drug solubility, Jouyban-Acree
model [15] draws greater attention due to its efficacy in predicting drug solubility in numerous
binary solvent systems at different temperatures. Although this model form is unique, model
constants vary for each combination of solute and binary solvent. Correlating drug solubility
to structural features of the solute and both solvents through the Jouyban-Acree model can lead
to a universal model to predict drug solubility in any binary solvent system at any temperature.
In the current work, QSPR approach is used to correlate drug solubility in binary solvent
systems to structural features such as molar refractivity, molecular weight, McGowan volume,
etc. of solute and both solvents using a modified version of Jouyban-Acree model. A brief
review of various drug solubility prediction methods is provided in section 1, while data
preparation and processing steps are discussed in section 2. For feature selection, genetic
algorithm is used on selected features to obtain an independent feature set. In section 3, a linear
dependency between drug solubility and identified feature set is anticipated and model
coefficients are obtained using MLR and also a weight-based optimization. In section 4, a
piecewise linear dependency of drug solubility on identified features is assumed and model
coefficients are obtained using a modified prediction error based clustering approach. Finally,
this article is concluded with comments on the efficiency of proposed models on drug solubility
data collected from the literature.
3.2 Data preparation and processing
Experimental drug solubility data of 63 diverse binary systems with varying solute and
solvents is consolidated from various resources. In twenty-five out of sixty-three systems, the
solubility data is obtained from water-cosolvent systems for different solutes, whereas it is
obtained from non-aqueous solvent systems for the remaining. The data contains twenty-seven
different solutes, a majority of which are based on anthracene (twelve binary systems). For
42
twelve out of these sixty-three systems, solubility data is obtained at two different temperatures,
whereas the data is obtained at a single temperature for the rest of the systems. Temperatures
in the data collected vary from 293K to 308K. Based on the above observations, we can confirm
that the data collected is well distributed over different combinations of solute, solvents, and
temperature and hence is useful for obtaining an acceptable model to predict drug solubility.
The experimental data collected consists of 766 data samples in which 150 samples belong to
pure solubility estimations. Jouyban-Acree model is designed in such a way that pure solubility
values are always predicted at their experimental values irrespective of the model parameters
used. Hence pure solubility samples are not considered for the mixed solubility prediction case
studies. The data is further screened such that no data sample considered in this case study has
a solute mole fraction less than 0.0001, thus reducing data samples count to 585.
PaDEL-Descriptor [81] is an open source software to calculate descriptors of different
categories ranging from constitutional descriptors to electrostatic descriptors. Structure files of
all the 49 compounds involved in the data are generated in smiles (.smi) format using
MarvinSkecth [82]. Structural files generated are processed using PaDEL-Descriptor to obtain
the structural features. Molar refractivity (AMR), McGowan characteristic volume
(McG_Vol), van der Waals volume (VABC), Molecular weight (MW), sum of atomic
polarizabilities (Apol) and first ionization potentials (Si) of both solvents and solute along with
topological polar surface area (TopoPSA), solvent accessible surface area (SolvAccSA),
excessive molar refraction (MLFER_E), combined polarizability (MLFER_S), overall solute
hydrogen bond acidity (MLFER_A) and basicity (MLFER_BH ) values of solute are selected
to account for the solvent-solvent and solute-solvent interactions [72]. All the 24 descriptors
are of different magnitudes and hence descriptors are scaled using their individual mean and
standard deviation values (mean centric scaling). The Jouyban-Acree model is modified by
43
normalizing the temperature term with room temperature and forms the basis for further
investigation.
 2 i  T 
ln X sT   f1 ln X1T  f 2 ln X 2T   f1 f 2   Qi  f1  f 2     (2.1)
 i 0   298 
Where, X sT the solubility (in mole fraction) of the solute in the system at temperature T, f1 and
f2 are the mole fractions of solvents in the solvent mixture, X 1T and X 2T are the solute
solubility values (in mole fractions) in pure solvents at temperature T in increasing fashion (i.e.
the solvent which has low solute solubility considered as solvent 1), Q0 , Q1 and Q2 are the
Jouyban-Acree model constants, which are dependent upon the solute and solvents involved in
the system. In this work, these model constants are assumed to be linearly dependent on
selected structural features. Assume Qi as a linear function of selected features, it can be
expressed as,
N
Qi   ci , j F j ; i  0,1, 2; (2.2)
j 1
Where, Fj is the jth structural feature value and N is the number of structural features. This
now makes it possible to generalize the model to be able to predict solubilities for novel
systems, systems that are not used in obtaining the model. We now perform some algebraic
manipulations to render the equations in a standard multivariate linear model form.
Substituting Equation (2.2) in Equation (2.1):
 2  N  i  T 
ln X sT   f1 ln X 1T  f 2 ln X 2T   f1 f 2     ci , j Fj   f1  f 2    (2.3)
 i 0 j 1   298 
   
Rearranging Equation (2.3), and expand the right-hand side terms as follows:
 N N

  c0, j F j   c1, j Fj  f1  f 2  
 ln X   f ln X
T
s 1
T
1  f 2 ln X 2T    f1 f 2 

j 1
N
j 1
 T 
  298 
(2.4)
  c2, j Fj  f1  f 2 
2
 
 j 1 
44
Convert Equation (2.4) in the form of a multivariate linear model as follows:
 
Y  ln X sT   f1 ln X 1T  f 2 ln X 2T    c0, j j   c1, j  j   c2, j  j ;
N N N
(2.5)
j 1 j 1 j 1
 T 
where  j  f1 f 2   F j ;  j   j  f1  f 2  ;  j   j  f1  f 2 
 298 
Y is the difference between actual log solute solubility fraction in solvent mixture and sum of
the products of pure log solubility values and their respective mole fractions in the solvent
mixture. For the 585 data samples considered in this study, once values of Y , j ,  j , and  j
are known any regression technique can be applied to obtain regression coefficients ( ci , j ).
3.3 Feature selection
Feature selection is the process of identifying the best possible combination of features using
a predefined metric. Data is divided into 5 equal size partitions such that all partitions contain
a minimum of 10% data points from all 63 binary systems. Values of Y , j ,  j , and  j are
calculated using scaled structural features and the solubility data based on Equation (2.5).
Genetic algorithm (GA) is used to select the optimal feature set. Features are selected by
combining K-fold (K as 5) validation with GA. The feature selection procedure is executed for
five folds such that each time the data in four different partitions are used to obtain regression
coefficients ci , j . Variables for GA are binary ( b j ) i.e. whether a particular feature is selected
or not and the objective is weight based RMSE of the whole data. Since log solubility
predictions are biased towards low solubility data samples and general solubility predictions
favor high solubility samples, a weighted objective of both predictions is considered to be more
effective.
The weighted objective for optimization problem can be expressed as follows:
45
  
 
0.5
  N D  X n  X n    n   n 
 N D ln X orig  ln X pred 
0.5 2
pred 2

orig
min  10    
 (2.6)
   n 1 
bj
  n 1 ND
  ND  
   
Where N D is the number of data points and ln  X npred  can be estimated as follows:
 2  N  i  T 
ln X sT   f1 ln X 1T  f 2 ln X 2T   f1 f 2     ci , j b j Fj   f1  f 2    (2.7)
 i 0 j 1   298 
   
Where b j is a binary decision variable representing whether the jth structural feature is selected
or not. Regression coefficients ( ci , j ) are obtained using MLR on training data based on
Equation (2.5) with the features for which b j is one. After obtaining optimal binary variables
set for each fold using GA, the final set of features are selected with a 60% probability from
overall results i.e. if a particular feature is active for at least 3 folds out of 5 folds then that
feature is selected into the final feature set. The feature selection approach explained above is
employed on consolidated drug solubility data (585 data samples) using the inbuilt ga function
in MATLAB. Average and best fitness values at each generation for all the folds can be seen
in Figure 3.1. The details of the optimal features obtained in each fold along with different
prediction efficacy metrics are provided in Table 3.1.
In fold 1 and fold 5, optimal solutions denote that all features are essential for drug solubility
predictions. Fold 2 has the least active number of variables whereas variable 22 and variable 6
are not active in case of fold 3 and fold 4 respectively. It is interesting to note that in the case
of fold 3 and fold 4 (bolded values) leaving out certain variables doesn’t contribute
significantly to the optimal objective. It can be observed from Figure 3.1 that for fold 1 and
fold 5 the best objective value did not change throughout the generations.
46
Figure 3.1 Generation-wise best and average fitness values for all the folds
Table 3.1 Details of feature selection process using GA
Fold Features that are not Weighted objective MPD (solubility) R2 metric R2 metric (log
active in optimal (solubility) solubility)
solution
Optimal features All features Opt. All Opt. All Opt. All
1 - 1.4963 1.4963 49.028 49.028 0.311 0.311 0.601 0.601
2 1, 6, 9, 12, 19, 22, 23, 24 1.5549 1.8546 51.166 49.613 0.345 0.171 0.545 0.322
3 22 1.5272 1.5281 51.610 51.483 0.289 0.289 0.582 0.581
4 6 1.5094 1.5096 50.223 50.773 0.326 0.297 0.586 0.594
5 - 1.4883 1.4883 50.899 50.899 0.336 0.336 0.600 0.600
* Tuned parameters: ConstraintTolerance and FunctionTolerance set to 0 and also provided an initial guess that all variables
are 1 (i.e. active). Remaining parameters set to MATLAB default.
47
Since all variables are active (the solution coding corresponding to all variables are 1) is
supplied as an initial guess at zeroth generation, this means that all variables are important for
these two folds. Though GA terminated at different generations for different folds, the reason
for termination in all folds is that the optimal solution did not change for 50 consecutive
generations. R2 values of general and log solubility predictions represent that the optimal
solutions are more efficient than models consisting of all features. On the other hand, MPD
values for the models obtained by considering all features are better than the MPD values
obtained for optimal solutions. This is because the MPD metric is sensitive to predictions of
data samples with low solubility values. From the results obtained, it is evident that all features
can be selected for further investigation of solubility prediction using modified Jouyban-Acree
model.
3.4 Single model approximations
In this section, a single linear model i.e. modified version of the Jouyban-Acree model
given in Equation (2.3) is investigated to predict the drug solubility in binary solvent systems
through two approaches. The first approach is to obtain model coefficients using ordinary least
squares (OLS) and the second approach is to use a weight-based optimization approach (WBO)
to obtain model coefficients. For this case study, the data is divided into five equal size
partitions such that all partitions contain a minimum of 10% data points from all 63 binary
systems. The efficacy of single model assumptions is validated using K-fold validation, for a
K value of 5. This validation procedure repeats for five (K) times, each time the data in four
(K-1) different partitions are used to obtain model coefficients whereas the data in the
remaining partition is used to test the efficacy of obtained coefficients. Once this procedure is
completed, the average of coefficients over K folds are obtained and reported as final model
coefficients. The weight-based optimization is carried out for the objective specified in
48
Equation (2.8) using quasi-newton algorithm (fminunc solver in MATLAB) with an initial
guess of zero for all variables. The objective of weighted optimization is as follows:
  
 
0.5
  NTrD  X n  X n   n   n 
 NTrD ln X orig  ln X pred 
0.5

2
pred 2

orig
min  10     
 (2.8)
cij  NTrD   n 1 NTrD 
  n 1    
 
Where, ci , j are variables for the optimization problem and NTrD is the number of data points
in the training dataset. The logarithmic solubility of the solute is estimated using Equation(2.3)
. Various metrics for testing the efficacy of model coefficients obtained using both approaches
are provided in Table 3.2. The drug solubility values obtained using the averaged model
coefficients over all folds in both approaches are plotted along with log solubility predictions.
Parity plots for both types of predictions can be seen in Figure 3.2.
Figure 3.2 Parity plots for general and log solubility predictions using both single model
approaches
49
3.5 Multiple model approximation
The reasons for the poor prediction for linear models might be that a single model cannot
capture the behavior for a wide variety of systems. There might be characteristics of systems
that group them together, in which case it might not be possible to identify a single global
model for these systems. To test this hypothesis and develop a model of higher fidelity, in this
section, logarithmic solubility of solute is assumed to be piecewise linearly dependent on
structural features i.e. the operating model will be different but linear in different regions of
feature space. Identification of models of this form is popularly referred to as multiple model
learning (MML). Consider the example depicted in Figure 3.3. From the figure, it can be seen
that the output can be characterized using linear relationships that are different in different
regions of input space. Identifying a single linear model throughout the input space will result
in poor approximation. MML has the possibility of improving the prediction accuracy if the
approach automatically identifies the different linearly operating models (four in this case) and
their operating regimes.
Figure 3.3 Multiple linear models underlying in different input partitions of data
50
An MML approach that was recently proposed based on prediction error based clustering
[78], [83] is explored in this work. To examine the robustness of piecewise linear models, a
two-layer testing approach is used. The data is divided into five equal size partitions such that
all partitions contain a minimum of 10% data points from all 63 binary systems. The data in
four partitions are used to identify linear models using K-fold validation approach. The leftover
data is set aside as the global test set and used in the second layer testing. In each fold, data in
three (K-1) different partitions are used to obtain multiple linear models and data in the
remaining partition (K-fold test set) is used to test the obtained models in the first layer testing.
Unlike in a single model scenario, predicting the output of a new data point using multiple
models is not trivial. K nearest neighbors (KNN) is the most frequently used testing approach
to find which model should be used to predict the output of a new data point. Since a prediction
error (PE) based clustering approach is used to identify multiple models, a new strategy is
proposed to identify a suitable model to predict the output of a new data sample. The details of
the clustering approach along with the proposed testing strategy are provided below.
Kuppuraj and Rengaswamy [78] proposed a prediction error based fuzzy clustering
approach to identify underlying multiple linear models in any given data. The advantage of
prediction error based approaches over the traditional Euclidian distance based clustering
approaches is that samples are grouped based on their response to output variables thus
reducing misclassifications at boundaries. For the sake of brevity, the steps of the algorithm
are provided below. More insights into the PE based algorithm can be obtained from our
previous work [78]. The objective of PE based clustering algorithm for grouping M data
samples into N models is as follows:
1 N M q 2
f     ij y j  Ci x j 2 
2 i 1  j 1
(2.9)

51
Where y j , x j are the response (output) and features vector (input variables) of sample j
respectively. Ci represents the model parameters vector of cluster i such that yˆ j  Ci x j .
1. Initialize N models with different parameter values. Each cluster represents a model
with different parameter values.
2. Calculate the prediction error of sample j with respect to cluster i as follows:
PEij  y j  Ci x j (2.10)
2
3. Compute membership of sample j to cluster i as follows:
1 i  1,2, N ; j  1,2, M ;
ij  ; (2.11)
N  PEij 
2
q 1 N  clusters, M  samples
 
k 1  PEkj


4. Update cluster centers (model parameters) as follows:
Cir 1  Cir   r g r (2.12)
It is a line search optimization, where search direction can be estimated as follows:
f  M q 
g r     ij  y j  Cir x j  xTj  (2.13)
Ci  j 1 
The step length of the search can be estimated as follows:
   g x   y  Cir x j 
M N T
q r
ij j j
j 1 i 1
r  (2.14)
   g x   g x 
M N T
q r r
ij j j
j 1 i 1
5. Calculate new prediction errors
6. Calculate root mean square error (RMSE), where b is the best model fit for sample j

M
j 1
PEb2 j
RMSE  (2.15)
M
52
7. Terminate based on a criterion (RMSE less than predefined limit or number of iterations
exceeds the limit) and go to next step else go to step 3
8. Merge like models based on a cosine angle metric and obtain new model parameters
using OLS. Finally, report final models and input partitions.
 Ci CkT 
The cosine angle metric between clusters i and k: ik  cos1   (2.16)
 Ci Ck 
Testing strategy:
Traditional clustering algorithms operate based on Euclidian distance; hence, KNN is a
suitable approach to identify an appropriate model for a new data (test) sample. In KNN, first,
the Euclidian distances from the test sample to all samples in the training data are evaluated.
Then the nearest K samples to test sample and their corresponding models are identified. Now,
the model containing the highest number of samples from the nearest K samples is considered
to be a suitable model for the test sample. The motivation behind prediction error based
clustering is to accurately classify data samples that are close in variable space but are
characterized by different models. In such cases finding a suitable model for a test sample using
traditional KNN may not be effective. In the proposed strategy we incorporate prediction error
along with KNN. First, the Euclidian distance from the test sample to all samples in the training
data are evaluated. Next, the K nearest samples for the test sample are identified. Next,
prediction errors for these K nearest samples with respect to each model are computed and
averaged over the models. The model with the least averaged prediction error (for the K nearest
samples) is considered the most suitable model for the test sample.
The proposed testing strategy is illustrated with an example of a single input single output
system, which comprises two different linear models in the operating region. It can be observed
from Figure 3.4 that there exist two models in the plotted data. All the blue colored data points
belong to model 1, which is represented by the blue line, whereas the red data points and red
53
line represent model 2. If traditional KNN approach with a K value as 3 is used to obtain the
suitable model for a new test sample with input value as 5.3 (represented in black diamond),
model 2 (red line) will be selected as the suitable model for the test sample since out of the
three nearest samples (S1, S2 and S3), two belong to model 2. If the proposed testing strategy
is used, first the prediction errors for the three nearest samples using both models will be
calculated. The average prediction error for both the models will be evaluated and the model
with least average prediction error will be identified as the suitable model. From Figure 3.4, it
can be observed that model 1 has the least mean prediction error hence it is the suitable model
for the new test sample with input value 5.3.
Drug solubility estimation using multiple models:
The prediction error based clustering algorithm works on the principle that piecewise linear
models of the form yˆ  Ci x characterizes the data in different input regions. Hence, once the
partitions are identified, final model coefficients are obtained using OLS. As discussed earlier,
in case of single model approximations, the linear model obtained using OLS favor low
solubility samples. Hence, a weighted objective was incorporated into PE based clustering to
obtain models without any bias towards any particular range of solubility samples. In the case
of single model approximations, it is also observed that replacing 20% of data every time for a
new fold results in a significant deviation in model parameters. In the case of multiple models,
these deviations can result in ambiguity when the final set of models is reported. To address
these two issues, the PE based clustering approach is modified such that final models are
obtained using a weight based optimization.
54
Figure 3.4 Prediction error based Knn strategy to identify a suitable model for a test molecule
55
Step 1 in the PE algorithm is modified such that the models are initialized with the final
model parameters obtained in the previous fold. To incorporate the weighted objective, final
models in step 8 are obtained using optimization instead of OLS. This optimization is carried
out using quasi-newton algorithm (fminunc solver in MATLAB) for the objective specified in
Equation(2.8). The final parameters obtained in the previous fold are used as an initial guess in
the current fold so that the models obtained in all the folds will be consistent. This clustering
procedure is repeated until models overall folds converge within a predefined similarity metric.
The similarity metric is explained in detail with a simple example. Let’s assume that there exist
N piecewise linear models in data and obtain model parameters in all K folds. Now compute
the cosine angle metric provided in Equation (2.16) for each model with respect to the other
models in remaining folds. Now, for the model i in the kth fold find the minimum angle in
between the N models associated with any other fold, repeat the same until minimum angles
with respect to all the other folds for that particular model are obtained. Now, identify the
maximum angle i , k  among the K  1 minimum angles obtained for the model i of fold k.
Repeat the above procedure for all the models in all the folds. The maximum angle among the
angles i , k  that are associated with all the models in all the folds is defined as the similarity
metric   and denoted as follows:
  max i , k  ; i  N ; k  K (2.17)
where    i, j 
i ,k  max min  k ' ; j  N ; k '  K  k '  k ; (2.18)
The flow chart of the modified PE based algorithm is provided in Figure 3.5. In case of fold
1 in phase 1, since no previous fold exists, models are initialized with random parameters,
whereas in step 8 the optimization is initiated with all variables as zeros as the initial guess.
The drug solubility prediction is carried out assuming that two piecewise linear models exist
56
in the data and the fuzzifier ‘q’ value is set to 1.5. The maximum number of iterations for PE
based clustering algorithm is set to 1000, whereas tolerance for similarity metric   is set to
10o. The models in each fold are tested on two test data sets i.e. K-fold test set and the global
test set. The suitable model for a data sample in the test set is identified using the proposed
testing strategy with K value as five i.e. the model, which has the least averaged prediction
error over the five nearest samples to the test sample is considered as the best suitable model
for that particular test sample. Similar models are identified in all the folds and are averaged to
test with the global test set.
The data separated out for K-fold validation is associated with the averaged models based
on the prediction errors, so that any new sample can use these data samples as neighbors for
selecting a suitable model. This final pair of models can be considered as the best models to
predict drug solubility in binary solvent systems irrespective of the temperature and
components involved in the system. The drug solubility profiles estimated using these model
coefficients for two different systems at different temperatures are plotted in Figure 3.7.
Solubility profiles estimated using the final models obtained from the single model
approximations are also included in this figure to show the efficacy of multiple models
explicitly. The details of prediction accuracy using various metrics for the models obtained in
the final phase are reported in Table 3.3. It can be observed from the table that the R2 values
for both general and log solubility predictions in all the folds are significantly improved using
multiple model approximations. It is interesting to note that R2 values for both general and log
solubility predictions are similar for all data sets in all the folds thus showing the effectiveness
of the weighted objective approach, i.e., there is no bias towards any category of samples,
unlike OLS approach.
57
The two-layer testing statistics prove the robustness of obtained models. The solubility
prediction MPD values of K-fold data and global test data set using the averaged parameters
(AP) models are significantly low underlining the fact that this final pair of models can be used
for drug solubility predictions in any binary solvent systems. It is evident from all three efficacy
metrics for K fold test data (K-test) in case of fold 3 (bolded values) multiple models
underperform, still much better than any single model metric. This can be attributed to the few
misclassifications of low solubility samples in the testing phase. Removing these outliers
improve the performance of multiple models.
To benchmark the performance of the multiple model approach with other popular
techniques, we trained a neural network (NN) to identify non-linear behavior. The neural
network was trained using a Levenberg-Marquardt backpropagation training algorithm for 3
different configurations i.e. the number of hidden layers used are 1, 2 and 5. The data was
divided into five equal size partitions such that all partitions contain a minimum of 10% data
points from all 63 binary systems. The data in four partitions are used to train the neural
network, whereas the remaining data used for testing. The network is trained using 'trainlm'
function available in MATLAB neural network toolbox. The efficacy metrics of three different
NN configurations are as follows – MPD values are 35.495, 36.296, and 44.001; R2 metric
values for solubility are 0.597, 0.900, and 0.838; R2 metric values for log solubility are 0.888,
0.814, and 0.884 correspondingly for hidden layer size of 1, 2 and 5. The superior performance
of the NN approach compared to single linear model approaches can be attributed to the ability
of neural networks to identify the non-linear behavior. Though the performance of the NN
approach is better than a single model approaches it still significantly underperforms when
compared to multiple linear model approach. To test the efficacy of the proposed approach, the
MPD values obtained are compared to the MPD values reported in the existing approaches
[61], [84]–[86] in Table 3.4.
58
Table 3.2 Various efficacy metrics of obtained models using both single model approaches
Fold OLS model Optimization model

MPD (solubility) R2 metric R2metric MPD (solubility) R2 metric R2 metric
(solubility) (log s) (solubility) (log s)
Train Test Train Test Train Test Train Test Train Test Train Test
1 48.800 49.937 0.346 0.129 0.615 0.538 50.721 52.115 0.590 0.372 0.584 0.522
2 48.518 53.991 0.199 0.083 0.651 -1.07 56.886 60.708 0.754 0.628 0.575 -1.17
3 50.210 56.573 0.288 0.286 0.603 0.496 56.714 64.440 0.721 0.471 0.533 0.380
4 49.685 55.125 0.291 0.309 0.603 0.556 55.331 59.399 0.700 0.680 0.517 0.444
5 50.659 51.860 0.350 0.229 0.621 0.518 54.164 56.417 0.656 0.442 0.568 0.455
AP 49.3347 0.3291 0.5963 54.3691 0.6472 0.5384
Table 3.3 Various efficacy metrics of multiple models obtained using the modified PE approach
Fold MPD (solubility) R2 metric (solubility) R2 metric (log s)
Train K-test G-test Train K-test G-test Train K-test G-test

1 7.426 22.255 17.327 0.996 0.985 0.969 0.993 0.904 0.927
2 7.440 76.034 18.354 0.995 0.960 0.978 0.990 0.908 0.940
3 7.590 161.670 19.086 0.998 0.905 0.987 0.991 0.872 0.924
4 6.966 31.395 48.825 0.997 0.953 0.958 0.993 0.941 0.890
AP 8.070 18.113 0.9947 0.980 0.991 0.938
59
Figure 3.5 Modified PE based clustering algorithm for drug solubility predictions
60
The reported MPD values of individual systems include pure solubility samples for a
rational comparison with the literature models. In the case of naproxen in ethanol-water at
298k, 3 out of the 11 samples have solubility values less than 0.0001 and hence these samples
are excluded while reporting the MPD values using the proposed approach. In the case of
naproxen in ethanol-water at 303k and Propyl p-hydroxybenzoate in PG - water at 300k, a
similar policy was used while reporting MPD values. Yalkowsky equation (model a) is a zero
parameter model, which is a linear combination (in terms of compositions) of the pure
solubility values of a drug in both solvents. Models b and c correlate solubility parameters to
drug solubility, whereas models d and e (proposed multiple models) are two QSPR based
approaches with different structural features. Systems with high MPD values (>1000 in case
of Yalkowsky equation) represent either the nonlinear interaction of solutes or presence of low
solubility values (0.0001 to 0.001) in the data of that binary system. It is interesting to note that
the MPD values of sulfanilamide in ethanol and water corresponding to models b and c are
poorer than the Yalkowsky equation. We can conclude from Table 3.4 that except for
acetaminophen (paracetamol) in the ethanol-water system, MPD values obtained using the
proposed approach are significantly better than other models.
Figure 3.6 MPD values of all 63 binary systems obtained using multiple models approach
61
Table 3.4 MPD metrics of various water + cosolvent systems using several approaches
Solute Cosolvent T(K) Nd MPDa MPDb MPDc MPDd MPDe

Acetaminophen PG 293 11 115.9 66.9 - - 22.85
Acetaminophen PG 303 11 98.37 56.3 - 5.4 2.80
Acetaminophen Ethanol 298 13 44.78 7.2 18.73 26.8 16.90
Caffeine Ethanol 298 11 60.43 41.6 - 21.1 8.22
Caffeine Ethanol 308 11 64.27 50.7 - - 6.06
Naproxen Ethanol 298 11 1803.2 23.4 - - 7.34(8)
Naproxen Ethanol 303 11 1788.2 20.5 - - 7.40(9)
Sulfanilamide Ethanol 298 12 31.24 43.8 31.95 18.2 6.27
Caffeine Dioxane 298 16 60.19 - 21.85 - 3.77
Methyl p- PG 300 11 920.94 - 18.16 14.4 14.07
hydroxybenzoate
Methyl p- PG 300 11 669.15 - 15.78 16.5 9.02
aminobenzoate
Ethyl p- PG 300 11 1991.0 - 18.71 9.4 9.17
hydroxybenzoate
Ethyl p- PG 300 11 1416.5 - 9.60 18.2 8.71
aminobenzoate
Propyl p- PG 300 11 4739.0 - 19.31 14.1 8.04(10)
hydroxybenzoate
Propyl p- PG 300 11 4021.8 - 9.62 22.6 4.89
aminobenzoate
(a) Yalkowsky equation [61], (b) Solubility prediction using partial solubility parameters
[84], (c) solubility prediction using an artificial neural network [85], (d) Jouyban-Acree
model – effect of solute structure [86], (e) Multiple models approach
62
Figure 3.7 Solubility profiles of two distinct binary systems at various temperatures
63
The MPD values of all 63 individual systems (if the data for a system is obtained at two
different temperatures, then the MPD value is calculated for all the data samples together) are
reported in Figure 3.6. MPD values obtained range from 0.72% to 31.03% with a standard
deviation of 6.45%. The minimum MPD value corresponds to system 24 (i.e. Benzoic acid in
CCl4 and n-Heptane) whereas the maximum MPD value corresponds to system 61 (i.e.
Sulfamethazine in water and ethanol). Only 4 out of 63 systems have MPD values greater than
20% whereas 51 systems have MPD values lesser than 10%. These observations demonstrate
the ability of the proposed approach in obtaining a generalized model for various binary solvent
systems.
The solubility profiles shown in Figure 3.7 belong to two different systems obtained at
various temperatures. The experimental solubility curves in case of system 1 (Acetanilide –
Dioxane – Water) are irregular owing to the noise in solubility estimations whereas the
predicted solubility curves are smooth indicating the theoretical behavior of solvent
interactions in the system. The solubility profiles in system 2 (Caffeine – Ethyl acetate –
Ethanol) show a clear dependency of solubility on temperature. The proposed multiple model
(MM) approach is able to capture the temperature dependency effectively. It is interesting to
note that predictions using multiple models (MM) approach are very accurate, even though
magnitudes of the experimental solubility fractions are of different orders in these two systems.
It can be concluded from the solubility profiles that multiple models are far superior for
prediction of drug solubilities over the models that are available in the literature. This can be
seen in the tremendous improvement in the R2 values between these approaches (Tables 3.2
and 3.3).
3.6 Conclusion
In this study, a QSPR based approach using multiple linear models is examined to predict
64
drug solubility. Drug solubility is assumed to behave in a piecewise linear fashion in different
partitions of structural features. The temperature term in the Jouyban-Acree model is
normalized with room temperature to avoid the influence of temperature magnitude on model
coefficients. For this QSPR approach, various structural features of the solute and both solvents
are selected and processed through feature selection using GA. The log solubility values are
initially assumed to be linearly dependent on structural features and model coefficients are
obtained using OLS and weight based optimization approach. The weight-based optimization
approach is shown to be better than the OLS approach; however, both models were not of high
enough fidelity. Later, log solubility values are assumed to be piecewise linearly dependent on
structural features and model coefficients are identified using a modified PE based clustering
algorithm. A new testing strategy is also proposed to identify a suitable model for test samples
in case of PE based clustering approaches. The prediction efficacy of the final pair of models
is tested on a global test set. The MPD and R2 values demonstrate that the final set of models
can be used to predict the solubility of drugs irrespective of the solute and solvents involved in
the system and temperature.
In this chapter, we tested the efficacy of multiple model learning to identify the non-linear
behavior between the feature set and the property of interest i.e. solubility in this work. Initially,
GA is used to identify the significant feature set. It is observed in the feature selection phase
carried using K-fold validation that not all the features are significant in all the partitions of
data. A single set of features can be selected as important in a single model case, whereas doing
this in a multiple modeling framework might be suboptimal. I In the case of multiple models,
where each partition contains a different linear model, it will be beneficial to identify
appropriate significant features for each partition. To address this issue, we propose some
modifications to the existing prediction error based clustering approach in the next chapter to
obtain the number of underlying models and their orders in a single framework.
65
CHAPTER 4
Prediction error based fuzzy clustering approach using statistical
analysis for piecewise linear model identification
Several engineering systems can be modeled using a multiple model framework, where
specific models describe the system in different regions that are defined by input partitions.
Though this multiple model framework has been applied with different names in different fields
i.e. Linear parameter varying systems, Operating Regime based Models, Multiple Model
Estimation, Piecewise Models, Local Regression models, etc., the working principles are
generally the same [87]. Multiple model learning (MML) is a procedure for estimating input
partitions and their corresponding model parameters in each partition. MML has been applied
in numerous applications such as image segmentation in computer vision [88], image
processing, pattern recognition [89], financial investing, home insurance [90]. In chemical
engineering, MML has been applied in control of distillation column [91], fermenters [92],
solar power plant [93] and fluid catalytic cracking unit [94]. MML problems can either be static
or dynamic. Further, models can be segregated based on linearity or non-linearity in
parameters. A recent two-part review [79], [80] provides comprehensive coverage of the
multiple model approaches for modeling and identification of complex systems. The review
addresses different approaches used to identify the input partitions, internal model structure,
and parameter estimation. The review also highlights the challenges in the development of
multiple model identification approaches and their application in different fields.
There exist three varieties of linear MML problems as shown in Figure 4.1. Primary level
66
MML problems are ones where the number of true models and input partitions are known.
These kinds of problems can be solved by applying ordinary least squares (OLS) for data points
belonging to each partition. Secondary level MML problems are ones where the number of true
models is known but input partitions are unknown [95], [96]. These problems can be solved by
a two-step approach; initially obtaining input partitions by a suitable clustering technique and
then applying OLS to compute model parameters in each partition. Tertiary and advanced level
MML problems are particularly challenging as neither the number of true models nor input
partitions are known [97], [98]. There is, however, some work attempting to solve this, starting
with a sufficiently high number of models and merging them subsequently [97], [98].
Multiple Model Learning (MML)
Primary Secondary Tertiary
Number of models Number of models Number of models

and input partitions known and input and input partitions
known partitions unknown are unknown
Figure 4.1 Multiple Model Learning Problem Classification
In case of static multiple linear models, since input variables might significantly influence
output variables only in some partitions of input space, true models of different orders can
exist. While solving static MML problems, in general, orders of models are considered to be
known and equal. In the case of dynamic MML problems, predicting true model orders (ny, nu)
is not a trivial task. Hence to solve such problems both the number of models and their orders
are assumed to be known [97], [98]. In the absence of this information, as would be the case
generally, estimation of both the number of models and their orders becomes a challenging
67
task. In this chapter, we focus on tertiary problems where we solve such problems using an
iterative approach - which consists of prediction error-based clustering and statistical
significance testing - without any assumptions on the number of underlying models or their
orders.
4.1 Literature review:
In this section, we briefly explain the evolution of clusterwise regression and various
algorithms proposed to solve static and dynamic (PWARX) multiple models. Multiple model
learning (MML) problems have been in focus for the past few decades. Early investigation of
MML was carried in the form of clusterwise regression [99], [95], [96], [100], [101], [102].
Spath [96] introduced clusterwise regression and proposed an algorithm to obtain input
partitions and calculate corresponding model parameters when the number of models is known.
The time complexity of the above algorithm is improved in a subsequent article [100]. DeSarbo
and Cron [99] proposed a conditional mixture maximum likelihood methodology to solve
clusterwise regression problems. Since clusterwise linear regression problems can be treated
as combinatorial optimization problems, DeSarbo et al. [95] proposed a simulated annealing
based approach to solve these problems. Hennig [101] developed three types of models for
clusterwise linear regression named as fixed partition model, finite mixture model with fixed
and random regressors. In a follow-up paper [102], identifiability of clusterwise parameters
was studied. Wedel and Kistemaker [103] applied a generalized clusterwise linear regression
method to solve benefit segmentation problem. Monte-Carlo approach is used to test the
significance of obtained cluster parameters. Frigui and Krishnapuram [88] proposed a robust
competitive agglomeration (RCA) based clustering approach to solve multiple model general
linear regression (MMGLR) problems. They used a least square prediction error as a measure
for clustering instead of the standard c-means distance. Clusterwise linear regression for a
continuous stochastic process is studied by Preda and Saporta [104].
68
Cherkassky and Ma [105] proposed an iterative framework to solve MML problems
assuming that the majority of the data samples are generated by one dominant model. In the
first step, a dominant model is estimated from entire data and corresponding data points are
separated out. Next, in the residual data points, another dominant model is estimated, and
corresponding data points are separated. This procedure is repeated until it satisfies any
predefined stopping criteria. The model parameters are estimated using a support vector
machine (SVM) based regression. Bezdek et al. [106] generalized fuzzy c-means clustering
algorithms for linear models and proposed an iterative approach (fuzzy c-lines) to identify
multiple linear models. They also provided theoretical proof for convergence. Dufrenois and
Hamad [107] proposed a new approach to simultaneously estimate multiple linear models
based on support vector regression (SVR). In their formulation, fuzzy weights are assigned to
all data points and updation of weights are mimicked from the standard fuzzy c-means
algorithm. Elfelly et al. [108] proposed a two-step procedure for the identification of multiple
models. In the first step, the number of underlying models is predicted using neural networks
with rival penalized competitive learning. In the second step, model orders and their parameters
are estimated using both K-means and fuzzy K-means clustering algorithms. The proposed
approach is applied to two nonlinear data sets as validation exercises.
MML for dynamic systems is of interest in various real-life applications [109], [110]. Most
studies in the literature consider piece-wise autoregressive exogenous (PWARX) models as
benchmark problems for dynamic MML problems. Ferrari-Trecate et al. [97] proposed a
clustering based algorithm to solve PWARX problems assuming a number of models and their
model orders. This algorithm initially identifies local data sets (LDs) then parameter vectors of
all LDs are calculated using least squares. They use K-means clustering to group parameter
vectors into distinct models equal to the number of models assumed and bijective maps to
obtain input partitions. Nakada et al. [98] proposed a Gaussian mixture model to recognize
69
PWARX models assuming a number of models and their model orders. Parameters of identified
models are calculated using least squares. Kuppuraj and Rengaswamy [78] proposed a
prediction error (PE) based approach to obtain input partitions and model parameters
simultaneously. Support vector classifiers are used in the above studies [78], [98] to obtain
boundary hyperplanes between adjacent clusters. PWARX models can also be identified using
lifting techniques [111]–[113]. Rodolfo et al. [114] proposed an approach to identify non-linear
dynamics using multiple heterogeneous models (i.e of different orders).
The procedure of evaluating the clustering results such as partitioning of input variable
space and number of models obtained is known as cluster validation. For example, while
solving well-known classified data sets, if the algorithm can find the number of underlying
clusters and classify data points exactly as per the pre-specified boundaries then the proposed
algorithm is considered robust. Clustering results are dependent upon input parameters, random
initialization of clusters and the clustering approach. Cluster validation is classified into three
categories; external, internal and relative validation [115]. External validation is comparing
clustering results with known models whereas internal validation is testing results with
predefined indices. Relative validation is comparing results with a different clustering scheme.
Rendon et al. [116] reviewed different external and internal validity procedures. Wu et al. [117]
used a cluster validity index as a fitness measure for offspring evaluation in a genetic algorithm
framework to solve the feature selection problem.
Several studies examine clustering performance with statistical testing, but no attempt has
been made to use statistical testing in combination with a clustering approach to find true model
orders. In this chapter, f-test [118] is incorporated with a clustering approach to obtain true
model orders, in an iterative manner, by testing the significance of parameters of each model.
This chapter is organized as follows. While a brief literature review on MML was provided in
section 1, in section 2, we propose an iterative clustering method with statistical testing to
70
remove insignificant input variables to obtain true model orders. In sections 3 and 4, we report
the clustering results obtained using the proposed approach on different benchmark problems.
In section 5, we demonstrate the efficacy of the proposed approach on two relevant engineering
problems. We conclude this chapter with a balanced discussion on the merits and demerits in
the last section.
4.2 PE based fuzzy clustering with statistical significance testing
The objective of this work is to estimate input partitions, true model orders and model
parameters for the data generated using multiple linear models of different orders. In the first
phase, the original data is classified into clusters using FMC approach [78]. The key difference
between standard clustering approaches and FMC approach is that while the standard clustering
approaches use a Euclidean distance metric, the FMC approach uses prediction error as a
distance metric. This makes it possible to adapt clustering approaches to the multiple model
learning problem. Viewed another way, while standard clustering approaches work on the data
space, the FMC approach works on an abstract parameter space, where the number of clusters
equals the number of models that describe the system of interest,
In the second phase of the algorithm proposed in this chapter, a statistical significance test is
carried out on the parameters of each model. Insignificant variables are removed thus reducing
the order of that model. This two-step procedure is iterated upon until the clustering procedure
with modified model orders contains only significant variables. Using this approach, it now
becomes possible to identify multiple models with different orders efficiently using the
integration of clustering and statistical testing. This makes it possible to solve MML problems
where the number of partitions, input space partitions, and the models in each of these partitions
are not known. We describe the proposed algorithm in detail below.
71
Phase 1: FMC Algorithm - Clustering for the identification of input partitions and draft models
Summary: FMC estimates different models in the input space of the form y  Cx , where C
contains the model parameters in an input partition. In the initialization phase of the algorithm,
the dataset is randomly segregated into a pre-specified number (N) of clusters. Initial model
parameters are calculated using OLS regression on data points corresponding to each model.
In the clustering phase, prediction error and membership value of each data sample with respect
to each of the clusters are calculated. The model parameters of each cluster are updated using
an update algorithm. The above procedure is repeated until it satisfies any one of the specified
convergence criteria. In model rationalization phase, the similarity between each of the
estimated models is calculated using a cosine-similarity measure. If the angle between any two
models is less than a specified threshold (5o), the data points used for prediction of the two
models are combined and new model parameters are predicted. Models with less number of
data points are discarded and reassigned to clusters that suit them the best.
1. Initialize N random models with different parameter values.
2. Compute the initial prediction errors using the Equation (2.10)
3. Compute the membership values using the Equation (2.11)
4. Update the cluster centers using any one of the following update algorithms:
1
M  M 
Algorithm I: C i
r 1
   ijq y j xTj   ijq x j xTj  (3.1)
 j 1  j 1 
Algorithm II: Cir 1  Cir   r g r (3.2)
5. Compute new prediction errors
6. Calculate the root mean square error using the Equation (2.15)
72
7. Terminate based on a criterion (RMSE less than predefined limit or number of iterations
exceeds the limit) and go to next step else go to step 3
8. Merge like models based on a cosine angle metric defined in Equation (2.16) and obtain
new model parameters using OLS. Finally, report final models and input partitions.
Phase 2: Statistical significance testing of variables (F-test):
Significance of a single or set of variables can be tested by computing the increase in
regression sum of squares that results by the addition of the set of variables to the existing
variables. It works on a hypothesis that if the computed f0 value is less than the distribution
value f ,r ,n p then that variable set is insignificant. To test this hypothesis, we need to calculate
the increment in the sum of squares due to the addition of a set of variables to the model and
residual mean square error of the model.
The increment in the sum of squares added by variable j for model Y  0  j X j   k X k :

SSR  j 0 , 1 , ,  j 1 ,  j 1 
, k  SSR  0 , ,j, , k   SSR  0 , ,  j 1 ,  j 1 , k  (3.3)
 y 
n 2
i  yi
The residual mean square error of the model: MS E  i 1 (3.4)
n p
2
 n 
  yi 
Regression sum of squares with variable j: SS R   0 , 1 , ,  j , ,  k   yy   
T i 1
(3.5)
n
Regression sum of squares without variable j: SS R   0 , ,  j 1 ,  j 1 , k  

k

r  0,  j
 k Sx y
r
(3.6)
 n  n 
  i , r   yi 
x
  xi , r yi   i 1  i 1 
n
Where S x y and  can be estimated by applying ordinary least
r
i 1 n
squares (OLS) to model Y   0   1 X1   j 1 X j 1   j 1 X j 1   k Xk .
Now  f0  value can be estimated as follows: f0 


SSR  j 0 , 1 , ,  j 1 ,  j 1 , k  (3.7)
MSE
73
If computed f0 value is more than f ,r ,n p then variable set j is considered as significant [118].
Iteration:
Re-run phase 1 with initial models to be ones that are identified after phase 2 statistical
testing. After phase 1, test statistical significance of the new model parameters using phase 2
computations. If no further revisions are made to the model orders, the MML algorithm stops
and the partitions, the corresponding models and the parameters are output of the algorithm. If
there are further revisions, the iteration continues till model revisions are not necessary. A flow
chart of proposed approach, where these two phases are iterated upon is provided in Figure 4.2.
Further technical details of phase 1 update algorithms:
The algorithm I is a traditional optimization algorithm to update cluster centers (based on
first ordinary necessity conditions). The objective for the optimization problem is as shown in
Equation 1. The first order necessity condition is that the first derivatives at optima are zero.
This condition is:
f  1 N  M q 
    ij y j  Ci x j
2
    0; (3.8)
Ci Ci 2 
 2 i 1  j 1 
After differentiating with respect to Ci and equating it to zero, the optimal solution obtained
will be:
1
M  M 
Ci    ijq y j xTj   ijq x j xTj  (3.9)
 j 1  j 1 
Algorithm II is a line-search based optimization update to minimize the sum of squares of
prediction errors assuming linear models. Line search is a gradient-based optimization, where
at each step the direction and the magnitude of next step are evaluated until an optimal solution
is reached. The gradient at rth iteration is as follows:
74
Define algorithmic Distribute whole data
Predict model parameters
parameters such as number randomly into groups with
using OLS regression
of models assumed, Max. equal members in each group
Iterations and fuzzifier
value. Set Iter as 0;
Reduce model order by Compute prediction error
removing insignificant and membership values
variables. Set Iter as 0;
Yes
Iter = Iter + 1
Report converged model No Does any model
parameters and their has insignificant
respective data members variables?
Update model parameters
using any FMC algorithm
Retain clusters with less

Conduct F-test on model
than 5% of total data size
parameters with α as 0.01 Compute prediction error
and add to suitable clusters for all data points using
updated models
Compute angle between RMSE tol (or)

cluster parameters of same Yes No
Iter Max. Iterations Compute membership
order and merge clusters
with angle less than 5o based on prediction error
Figure 4.2 Flow chart of PE based clustering using variable significance testing for MML
75
f  M q 
g r     ij  y j  Cir x j  xTj  (3.10)
Ci  j 1 
Magnitude of step i.e. step length, at rth iteration can be obtained using the following equation:
   g x   y  Cir x j 
M N T
q r
ij j j
j 1 i 1
r  (3.11)
 ijq  g r x j   g r x j 
M N T
j 1 i 1
The parameters are updated using: Ci

r 1
 Cir   r g r
4.3 Efficacy of proposed approach to estimate static multiple linear regression
(SMLR) models
In this section, we demonstrate the performance of the proposed approach on four example
problems. In all four examples, models with different orders are considered. The key
contribution of the proposed approach is to predict the different order models without any
assumption about underlying models, except linearity. The following examples shed light on
the efficacy of the proposed approach to estimate static multiple linear regression models. The
first three examples are multi-input single-output (MISO) systems and the fourth example is a
multi-input multi-output (MIMO) system. The first example shows the requirement for
statistical testing by comparing the clustering results obtained with and without statistical
significance testing. It also tests the ability of the proposed approach in removing insignificant
variables thus reducing models from predefined (large) order to true model order. The second
example is considered to examine the performance of the current approach with a change in
the size of data sets while the input partitions remain same. Data sets are uniformly generated
within the partitions. The third example is considered to test the current approach on data sets
with alteration in sampling of data. The fourth example tests the ability of proposed approach
in identifying the multiple models in case of MIMO systems. All case studies are solved using
76
the two FMC algorithms with a fuzzifier value of 2. Models of similar order are merged at the
end of the FMC algorithm if the angle between the models is less than the threshold (5o).
4.3.1 SMLR example 1
This problem is synthesized to show the efficacy of the proposed approach in reducing the
dimensionality of models from a high value to true model orders. In this example, we consider
a data set of 1000 samples of 10 variables each. Data consist of three underlying models of true
orders 5, 5 and 3 (6, 6 and 4 including intercepts) respectively. Model information is provided
in Table 4.1. None of the models are dependent on ‘Variable 2’, which is added to the data to
check the ability of the proposed approach in removing insignificant variables. In this example,
most of the data points are generated by a single model following the assumption of Cherkassky
and Ma [105]. Noise generated with uniform distribution in the range of [-0.1 0.1] is added to
the output variable Y.
Six models of order 10 (11 variables including the intercept) are initialized to solve the
current problem with both FMC algorithms. Converged model orders, parameters and number
of data points in each model with and without statistical testing are shown in Table 4.1. All
significant variables in true models are boldfaced in Table 4.1, which can be filtered out from
converged models if we use statistical analysis. Without statistical analysis, though the input
partitions are accurately obtained using the FMC I algorithm, converged models have
insignificant variables. Using the FMC II algorithm, data samples belonging to one true model
(C2) converge to two models M2 and M3. Due to the presence of insignificant variables in
models M2 and M3, the similarity measure (angle between the models) becomes higher than
the tolerance (5o). If the coefficients corresponding to the insignificant variables become zero,
then the similarity measure would become less thus resulting in merging of the models. It can
be observed from Table 4.1 that the proposed approach is able to completely remove the
77
insignificant variable (Variable 2) and obtain the true model orders of 5, 5 and 3 from the
initialized models of order 10.
This example consists of four models of different orders. Since this problem is generated to
test the current approach on different data sizes, three different cases are tested with 200, 500
and 1000 total data points respectively. An equal number of data points are generated in each
partition (using the model for the chosen partition). Noise generated with uniform distribution
in the range of [-0.1 0.1] is added to the output variable Y. Six models of order 5 (6 – with
intercept) are initialized. Both FMC algorithms are tested to identify input partitions, true
model orders, and their parameters. Model information and clustering results are reported in
Table 4.2. With the increase in data size, the estimated parameters of models in each partition
can be seen to be converging towards the true model parameters. This ensures the consistency
of the proposed algorithm.
This is an example to show the efficacy of the proposed approach on different data sampling.
Though the data has five input variables, for better visualization of partitions, data is plotted
based on only two variables. The example includes three case studies. In the first case study,
data is uniformly distributed throughout the variable space and an equal number of data points
are considered for each partition. In the other two case studies, two partitions contain an equal
number of data points and the remaining two have an unequal number of data points where one
has almost half of the total data points. In the second case study, data is randomly generated
throughout the partitions, while in the third case, data is concentrated towards the boundaries
of partitions. Euclidean distance-based clustering algorithms will fail in these types of
situations. The data partition in all three case studies is shown in Figure 4.3.
78
Table 4.1 Original and converged model details of SMLR example 1
FMC No of
Models information
algorithm points
1.8 0.75 0.4 0.7 0.7 0.6  X 3 X 4 X 5 X 9 X 10 1
'
C1 600
Original
 0.3 2.5 1.9 3 0.9  2.5  X 5 X 6 X 7 X 8 X 10 1
'
C2 300
model
 1.2 0.2 2.5 0.1  X 1 X 9 X 10 1
'
C3 100
Without statistical analysis
M1 = [-0.0002 0.0003 1.8003 0.7496 0.4007 -0.0003 0.0005 -0.0002 0.7003 0.6999 0.5997] 598
I M2 = [0.000 0.000 0.0005 0.0001 -0.3001 2.5001 1.8996 3.0002 -0.0005 0.8999 -2.4991] 301
M3 = [-1.1986 0.0011 -0.0006 -0.0003 0.0014 -0.000 -0.0006 0.000 0.1996 2.4741 0.3250] 101
M1 = [-0.0007 0.0004 1.8005 0.7499 0.4005 -0.0005 0.0009 0.0001 0.7005 0.6977 0.5887] 603
II M2 = [-0.0601 0.0587 -0.0293 0.0087 -0.2365 2.4463 1.8407 3.0162 -0.0432 0.5382 -0.3176] 94
M3 = [-0.0004 0.000 0.0010 -0.0007 -0.2999 2.5005 1.8996 3.0001 -0.0008 0.9019 -2.5104] 198
M4 = [-1.1946 0.0040 -0.0034 -0.0026 0.0023 0.0023 0.0007 -0.0006 0.1999 2.3173 1.7436] 105
With statistical analysis
M1 = [ 1.8003 0.7496 0.4007 0.7002 0.6999 0.5999] 598
I M2 = [ -0.3001 2.5001 1.8997 3.0002 0.8996 -2.4974] 301
M3 = [-1.1985 0.1993 2.4748 0.3208] 101
M1 = [1.8005 0.7495 0.4006 0.7003 0.7000 0.6000] 599
II M2 = [-0.3001 2.5001 1.8997 3.0002 0.8996 -2.4974] 301
M3 = [-1.1949 0.1980 2.3295 1.6316] 100
79
Table 4.2 Original and converged model details of SMLR example 2
Case FMC No of
Model
study algorithm points
1.8  2.5 0.2  0.7   X 1 1
'
C1 X4 X5
 2.2 0.9  0.9 0.6  1.5  X 1 X 2 X 3 X 5 1
'
Original C2
- -
0.6  1.3 0.8 0.3 0.5  X 2 X 3 X 4 X 5 1
'
model C3
0.6 1.5 0.3  0.5  X 1 X 2 X 5 1
'
C4
M1 = [1.8012 -2.4993 0.2165 -0.6429 ] 49
M2 = [ -2.2003 0.9002 -0.8978 0.6184 -1.4806 ] 48
1 I, II
M3 = [ 0.6033 -1.2992 0.7973 0.2768 0.5192 ] 52
M4 = [ 0.5991 1.4988 0.3102 -0.5410 ] 51
M1 = [ 1.8007 -2.5010 0.2097 -0.6666 ] 124
M2 = [ -2.1984 0.8987 -0.8991 0.6121 -1.4828 ] 125
2 I, II
M3 = [ 0.5996 -1.2995 0.8012 0.2935 0.5125] 126
M4 = [ 0.5975 1.4997 0.3011 -0.5065 ] 125
M1 = [ 1.7996 -2.5003 0.2074 -0.6728 ] 249
M2 = [ -2.1994 0.9004 -0.9001 0.5981 -1.4976 ] 253
3 I, II
M3 = [ 0.6007 -1.3004 0.7994 0.3070 0.4910] 250
M4 = [ 0.5990 1.5000 0.2971 -0.4912 ] 248
Table 4.3 Model information of SMLR example 3

No. Model
1.5 0.7 0.3  1.5 0.3  0.9  X 1 X 2 X 3 X 4 X 5 1
'
C1
 0.8 0.2  0.5 2.5 1  X 2 X 3 X 4 X 5 1
'
C2
0.7 1.7 1.9 0.8  X 1 X 2 X 5 1
'
C3
 3 2.3  1.5  X 1 X 4 1
'
C4
80
Table 4.4 Converged model details of SMLR example 3 without statistical analysis
Case FMC No of
Model
study algorithm points
M1 = [1.4995 0.6993 0.3000 -1.4980 0.2984 -0.8955] 248
M2 =[-0.0012 -0.8005 0.1995 -0.5037 2.4969 1.0468] 252
I
M3 = [0.7007 1.6969 0.0002 0.0015 1.8996 0.7994] 250
M4 = [-2.9994 0.0002 -0.0003 2.2978 -0.0061 -1.4424] 250
1
M1 = [1.4993 0.7094 0.3000 -1.4894 0.3418 -1.0579] 262
M2 =[-0.0119 -0.8267 0.1741 -0.4679 2.2743 3.0365] 234
II
M3 = [0.6968 1.7010 -0.0024 0.0195 1.8764 0.7242] 260
M4 = [-2.9996 0.0006 -0.0002 2.2981 -0.0057 -1.4477] 244
M1 = [1.4980 0.6996 0.2995 -1.5045 0.2987 -0.8681] 210
M2 = [-0.0001 -0.8020 0.2009 -0.5018 2.5000 1.0127] 487
I
M3 = [0.7007 1.7019 -0.0007 -0.0018 1.8957 0.8181] 153
M4 = [-2.9985 -0.0014 -0.0020 2.2898 -0.0029 -1.3822] 150
2
M1 = [1.4964 0.7017 0.3006 -1.5055 0.2828 -0.8476] 213
M2 = [-0.0016 -0.8032 0.2011 -0.4981 2.5012 1.0074] 500
II
M3 = [0.7026 1.7030 0.0009 -0.0015 1.8905 0.7996] 149
M4 = [-3.0010 -0.0017 -0.0259 2.0221 -0.0889 1.7319] 138
M1 = [1.4980 0.6996 0.2991 -1.4992 0.2977 -0.8834] 211
M2 = [-0.0002 -0.8018 0.2010 -0.5000 2.4989 1.0126] 490
I
M3 = [0.7007 1.7014 -0.0001 -0.0059 1.8968 0.8514] 149
M4 = [-2.9984 -0.0003 -0.0011 2.3039 0.0035 -1.5630] 150
3 M1 = [1.4279 0.6473 0.2462 -1.5946 0.1478 0.5445] 103
M2 = [-0.0017 -0.8044 0.2066 -0.5048 2.4858 1.1096] 518
II M3 = [-3.0459 -0.0662 -0.0429 1.7290 -0.2304 5.8630] 147
M4 = [1.4918 0.6652 0.2796 -1.5296 0.3263 -0.3758] 81
M5 = [0.7093 1.7055 0.0056 0.0147 1.8751 0.6297] 151
81
Case 1 Case 2 Case 3
10 10 10
8 8 8
6 6 6
X5
X5
X5
4 4 4
2 2 2
0 0 0
0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10
X4 X4 X4
Figure 4.3 Partition of input data for SMLR example 3(* - C1, o - C2, - C3, + - C4)
Table 4.5 Converged model details of SMLR example 3
Case study FMC algorithm Model No of points

M1 = [1.4995 0.6994 0.3001 -1.4981 0.2983 -0.8969] 247
M2 =[-0.8004 0.1994 -0.5037 2.4966 1.0433] 252
1 I, II
M3 = [0.7007 1.6969 1.8996 0.8121] 250
M4 = [-2.9990 2.2981 -1.4929] 251
M1 = [1.4980 0.6996 0.299 -1.5045 0.2984 -0.8674] 209
M2 = [-0.8020 0.2009 -0.5017 2.5001 1.0114] 489
2 I, II
M3 = [0.7008 1.7019 1.8957 0.7991] 152
M4 = [-2.9985 2.2909 -1.4316] 150
M1 = [1.4980 0.6996 0.2991 -1.4992 0.2977 -0.8834] 211
M2 = [-0.8018 0.2010 -0.5000 2.4989 1.0115] 490
3 I, II
M3 = [0.7006 1.7014 1.8964 0.8026] 149
M4 = [-2.9985 2.3047 -1.5505] 150
82
In case 1, each partition has 250 data points whereas in case 2-3, the four partitions have
210, 490, 150 and 150 data points respectively. Model information is provided in Table 4.3
Noise generated with uniform distribution in the range of [-0.1 0.1] is added to the output
variable Y. All three case studies are initially solved using both FMC algorithms without
statistical analysis and the clustering results obtained are reported in Table 4.4. This problem
is solved using the proposed approach by initializing six models of order 5 (6 variables
including the intercept) and the corresponding results are reported in Table 4.5. The benefits of
the proposed approach are evident from the clustering results. Without statistical analysis, FMC
II algorithm is unable to identify the exact number of underlying models in case 3 and input
partitions in all the three cases. However, input partitions obtained using FMC I are
satisfactory, converged models have insignificant variables. The proposed approach adequately
predicts the input partitions and the respective model parameters using both FMC algorithms,
though there is minor misclassification of data points.
The next step is to test the performance of the proposed approach on MIMO systems. The
example chosen contains multiple outputs in different partitions with models of varying orders,
as can be seen in Table 4.6. The differences in model orders for each output in a partition make
this a challenging problem. Data points are generated from 3 different models which involve 5
input variables. Each model consists of 100 data points. Noise generated using uniform
distribution in the range of [-0.1 0.1] is added to the output variable Y. This problem is solved
using the proposed approach by initializing six models of order 5 (6 including the intercept).
Information about original models and converged models is provided in Table 4.6. It is
interesting to note that in case of MIMO systems, multiple outputs help in improving clustering
efficiency since each model will have a prediction error for each data point, which is
83
consolidated from multiple output predictions. The proposed approach can estimate the exact
input partitions and model parameters. There are minor variations in model parameters due to
the addition of noise to the data.
Table 4.6 Original and converged models of SMLR example 4 using both FMC algorithms
No of
Model Original model Converged model
points
[1.4980 -1.0010 0.7579 -0.4708]
1.5  1 0.75  0.5 X1 X 4 X 5 1
'
1 100
1 0.65  0.1 1 0.5 X1 X 2 X 4 X 5 1 [0.9980 0.6505 -0.1011 1.0079 0.5294]
'
1.5  0.6 0.9 1.3 X 2 X 3 X 5 1

'
[1.5018 -0.6017 0.8990 1.2972]
2 100
1 0.75  0.5 0.75 1 X1 X 2 X 3 X 5 1
'
[1.0008 0.7518 -0.5018 0.7492 0.9975]
0.2 2 0.7 0.3  1.25 X1 X 3 X 4 X 5 1
'
[0.1999 2.0001 0.6979 0.2979 -1.2299]
3  1.5 2 0.75 1 1.5 X X 3 X 4 X 5 1
' 100
2
[-1.5000 2.0001 0.7479 0.9979 1.5202]
4.4 Efficacy of proposed approach to identify PWARX models
PWARX models are like SMLR models except that the problem is realized in a dynamic
setting and the output at a given time depends on outputs from previous times and the
exogenous inputs from previous times. Thus, the regressors here are the time-lagged outputs
and inputs. The order is dependent on the number of time outputs (ny) and inputs (nu), affecting
the current output. In this section, we demonstrate the performance of the proposed approach
in identifying piece-wise autoregressive exogenous (PWARX) models. The section is divided
into two subsections. In the first subsection, we have tested the proposed approach on a
PWARX problem from the literature [78]. In the following subsection, we test the approach on
an example to show the effectiveness of the approach in reducing high assumed model orders
to true orders in the dynamic case. Both examples are SISO systems and contain models with
different (ny, nu) orders. These case studies are solved with both FMC algorithms using a
fuzzifier (q) value of 2. Similar order models are merged at the end of each convergent iteration
if the angle between the models is less than 5o.
84
4.4.1 PWARX example 1
This example is adapted from the literature [78] and solved using two initial guesses (2 cases).
In the first case the initial guess of both ny, nu is assumed to be two. In the second case a higher
model order is assumed (both ny, nu as 4). It is shown that both the guesses converge to true
model orders. Input data is generated with uniform distribution in the range of [-10 10] and is
partitioned into four different models. Noise generated with uniform distribution [-0.1 0.1] is
added to the output variable. Since the input data is randomized, we could not regenerate the
exact number of data points in each model as reported in the literature [78]. Number of data
points generated in case 1 by models C1, C2, C3, and C4 are 150, 85, 154, and 111 respectively.
On the other hand, number of data points generated in case 2 by models C1, C2, C3, and C4
are 150, 86, 153, and 111 respectively.
As mentioned earlier in case 1, six models of order five (ny, nu assumed to be 2, and an
intercept) are initialized to solve the problem while in case 2 six models of order nine (n y, nu
assumed to be 4, and an intercept) are initialized. The details of true and converged models are
given in Table 4.7. Except for a few misclassifications, the proposed approach is able to identify
input partitions, a number of underlying models and their true orders. In the second case despite
assuming higher order models, the proposed approach can identify insignificant variables and
converge to true models. It is evident from this example that for dynamic case studies the
proposed approach can identify true model orders through initialization with reasonably high
model orders.
4.4.2 PWARX example 2
This is an example consisting of four models with different orders of ny and nu varying from
one to three. Input data is generated from a uniform distribution in the range of [-10 10] and
noise added to the output variable is generated using a uniform distribution in the range of [-
85
0.1 0.1]. The number of data points generated from models C1, C2, C3, and C4 are 128, 108,
127 and 137. This problem is solved using both FMC algorithms by initializing six models of
order 7 (ny and nu as 3 and an intercept). The proposed approach accurately predicted the
number of true models and their orders. The original models and number of data points in each
model and their model parameters obtained using both FMC algorithms are shown in Table
4.7. It can be observed from the results that the proposed approach is able to converge to true
model orders (3, 7, 5 and 4 respectively) from assumed higher orders (all are 7). A close look
into the data suggests misclassification of a few data points, a result of the noise added to the
output variables at each time instance.
4.4.3 PWARX example with non-linear dynamics
In this study, we considered the piece-wise non-linear dynamic model from Gegundez et al.
[119]. Original models contain input and output variables of the previous time step and their
squared terms. This problem is initially solved using OLS and the linear model obtained has
statistically poor performance (R2 = 0.3529) in identifying the non-linear behavior. Hence, we
tried to solve the above problem using both FMC algorithms by initializing four linear models
(involve non- linear variables). The original model is simulated for 2000 time samples, of
which the first 1600 (240 of M1, 695 of M2 and 665 of M3) samples are used to identify
multiple models and the remaining samples are used to validate the obtained models. Input U 
and noise  E  are generated using uniform distribution in the range of [-4 4] and [-0.1 0.1]
respectively. For each sample in test data, a suitable cluster is predicted using k-nearest
neighbors (KNN) approach with a k value of 5. Information about original and converged
models is given in Table 4.8.
86
Table 4.7 Original and converged models of PWARX example 1 and 2 using both FMC algorithms
No of
Original model Case Converged model
points
  0.7 0.7 U k 1 1'  Ek [-0.6958 0.7319] 148

 if U k 1   10  4 [-0.4015 -0.6986 -0.2963] 87
 1
  0.4  0.7  0.3Yk 1 U k 1 1  Ek [0.5997 -0.2029 -0.3008 0.1077] 151
'
 if U   4 0 [0.3009 -0.7003 0.4934 0.5505] 114

k 1
Yk  
 0.6  0.2  0.3 0.1Yk 1 U k 1 U k  2 1  Ek [-0.6958 0.7330] 147
'
 if U  0 6
 k 1   [-0.4002 -0.7027 -0.3095] 88
2
 0.3  0.7 0.5 0.5Y Yk  2 U k 1 1  Ek
'
[0.5994 -0.2023 -0.3008 0.1077] 151
 k 1
 if U k 1   6 10 [0.3009 -0.7003 0.4934 0.5505] 114

  0.8 0 0 1.5 0 0  0.7    Ek
 M1 = [0.7999 1.5075 -0.6428] 128
 if U k 1   10  5
 0.7  0.2 0.4  0.9  0.5 0.8  0.5   Ek

 if U k 1   5 0 M2 = [ 0.7003 -0.2007 0.4006 -0.9050 -0.4986 0.800 -0.5137] 110
  0.3 0.8 0  0.4 0.3 0 0.5   E
Yk   k
 if U k 1   0 5
 M3 = [-0.2998 0.8003 -0.4052 0.2997 0.5179] 127
  0.6 0 0  0.7 0.8 0  0.9    Ek
 if U k 1  5 10

 where   Y Y Y U U U 1' M4 = [0.5995 -0.7023 0.7998 -0.8901] 135
 k 1 k  2 k  3 k 1 k 2 k 3
87
The bold parameters (zeros) in the table are identified as insignificant by the proposed
approach. The R2 values for test data set using both FMC algorithms are 0.949 and 0.924
respectively. The high magnitude residual error for few data points in the test data set is because
of the wrong choice of KNN model, which is caused by the misclassification of data points at
the boundary. This can be avoided by using alternative validation methods such as weight-
based nearest neighbors approach, condensed nearest neighbor approach, etc. It can be
observed from Table 4.8 that both FMC algorithms are able to identify non-linear dynamics of
variables using multiple linear models adequately.
4.5 Efficacy of the proposed approach on two real-life case studies
4.5.1 Identification of energy performance of residential buildings
In this case study, eight input parameters: relative compactness (𝑥1 ), surface area (𝑥2 ), wall
area (𝑥3 ), roof area (x4 ), overall height (x5 ), orientation (x6 ), glazing area (x7 ), and glazing
area distribution (x8 ) are used to predict the energy performance of residential buildings
characterized by two output variables: heating load (𝑦1 ) and cooling load (𝑦2 ). Tsanas and
Xifara [120] studied the effect of these input parameters on output variables using statistical
machine learning tools. Galzing area followed by relative compactness are identified as the
most significant input variables based on the importance metrics estimated using random
forests modeling. Linear models for the two outputs are independently built using iteratively
reweighted least squares.
In our work, for this case study, 768 samples collected from the UCI machine learning
repository [120] is used. This data is randomly divided into training and testing sets. 614
samples are used to obtain multiple models, whereas the remaining samples are used to test the
performance of the developed models. Initially, output variables are assumed to be linearly
dependent on input variables and the model parameters are identified using ordinary least
88
squares (OLS). Both RMSE and R-squared metric values of training and testing sets suggests
that the prediction accuracy can be further improved using either a suitable non-linear model
or piecewise linear models.
Piecewise linear models are obtained using the proposed algorithm. Multiple models are
obtained using the FMC II algorithm with a fuzzifier value of 2 and a maximum number of
iterations as 500. Models of similar order are merged at the end of the FMC algorithm if the
maximum angle between the models is less than the threshold (5o). The model parameters
obtained with and without statistical testing along with the R-squared metric values are
tabulated in Table 4.9. K-nearest neighbors approach is used to identify a suitable model for a
new sample with a k value of 5. It can be observed from Table 4.9 that piecewise linear models
perform better than single linear models. Though the models obtained without statistical testing
perform slightly better than the models obtained with statistical testing, they contain
insignificant variables in the models. The bolded values in case of multiple models obtained
using statistical testing are identified as insignificant thus the true orders are identified. It is
interesting to note that both the glazing area (𝑥7 ), and relative compactness (𝑥1 ) exists in the
models that are obtained using statistical testing thus validating the claim made by Tsanas and
Xifara [120]. None of the models obtained using the statistical testing contain variables 2, 3
and 4 concluding that these variables do not affect the output variables at any operating
condition. However, these variables appear in the models obtained without statistical testing.
This case study illustrates the effectiveness of the proposed approach to obtain models with
only significant variables thus avoiding the redundant variables.
4.5.2 Identification of non-isothermal CSTR model dynamics for control
In this case study, the proposed clustering approach is used to identify dynamics of a non-
isothermal continuous stirred tank reactor (CSTR) with irreversible reaction using PWARX
models. The effluent concentration (𝑦1 ) and temperature (𝑦2 ) is controlled using the coolant
89
flow rate (𝑢). The process model consists of two nonlinear ordinary differential equations
[121]. The differential equations are simulated for 1000 time samples using MATLAB
Simulink with the model parameters and initial conditions provided in Nikravesh et al. [121]
for 1000 coolant flow rates (𝑢𝑖 ). The input values are generated at three different operating
regions. The input values and corresponding output values are shown in Figure 4.4. The data
is randomly divided into training and testing (local) sets in a ratio of 80:20. Underlying linear
models are identified assuming dynamic models are of order 2 (ny1 = ny2 = nu = 2).
To compare the performance of the proposed clustering procedure multiple models are
identified using two recently published clustering approaches proposed by Wang et al. [122]
namely local gravitation clustering (LGC) and communication with local agents (CLA). Wang
et al. concluded, by testing on different benchmark studies, that both LGC and CLA algorithms
perform better than some of the well-established clustering approaches like Density-Based
Spatial Clustering of Applications with Noise (DBSCAN). Along with these clustering
approaches a neural network (NN) approach was also tested to identify the non-linear
dynamics. The neural network is trained using a Resilient Backpropagation training algorithm
for 3 different configurations i.e. a number of hidden layers used are 1, 5 and 10.
In our approach, the FMC II algorithm is initialized with five models. LGC and CLA
algorithm codes are downloaded from MATHWORKS file exchange and cluster identification
is performed on the concatenated space of both input and outputs. Once the clusters are
identified, model parameters are calculated using the OLS approach. In the case of the NN
approach, the network is trained using 'trainrp' function available in MATLAB neural network
toolbox. It can be observed from the plots of yk Vs 𝑦𝑘−1 in Figure 4.5 that the simulated data
inherently consists of three clusters.
90
Table 4.8 Information of original and converged non-linear PWARX models for training data set
Original model Converged model Samples R2/RMSE

 M 1   0.4 1 0 0 1.5  xk   Ek [-0.256 0.996 0.018 0 1.759] 245
 0.9996/
if 4Yk 1  U k 1  10  0 FMC 1 [-0.311 0.50 0.001 0.50 -1.687] 696
 0.0574
 M 2   0.3 0.5 0 0.5  1.7   xk   Ek [0.501 -0.999 0.299 -0.1 -0.497] 659

 [-0.825 1.0129 -0.053 0 0.702] 251
Yk   if 5Yk 1  U k 1  6  0 0.9996/
 M 3   0.5  1 0.3  0.1  0.5  xk   Ek FMC 2 [-0.321 0.50 0.003 0.50 -1.673] 700
0.0615
 [0.503 -0.999 0.301 -0.1 -0.499] 649
 otherwise
 0.3529/ -
 where   xk   Yk 1 U k 1 Yk 1 U k 1 1
T
  2 2
 Single [-0.391 0.125 0.040 0.284 -1.023] 1600
2.3409
Table 4.9 Information of converged models and corresponding metrics for prediction accuracy
Model parameters   RMSE R2 value

Model type
Where y   xT and x   x1 , x2 , x8  x = [x1 , x2 … x8 ] Train Test Train Test
Reference [-20.595 -0.012 0.039 0 5.361 0.017 20.225 0.179; [2.970; [2.907; [0.912; [0.922;
Idea -19.437 -0.001 0.020 0 5.654 0.198 14.575 0.017] 3.214] 3.306] 0.885] 0.884]
Multiple 𝜃1 = [-10.855 -0.005 0.015 0 4.211 -0.031 17.241 0.233;
(without -6.371 0.002 0.001 0 4.112 0.118 11.864 0.0176] [1.758; [1.754; [0.969; [0.971;
statistical 𝜃2 = [-35.683 0.013 -0.006 0 7.583 0.045 20.871 0.007; 1.800] 1.921] 0.963] 0.960]
testing) -37.056 0.028 -0.034 0 8.187 0.261 15.681 -0.0818];
𝜃1 = [-12.929 0 0 0 4.874 0 17.751 0.187;
Multiple (with
-4.448 0 0 0 4.332 0 13.523 0 ] [1.892; [1.812; [0.964; [0.970;
statistical
testing)
𝜃2 = [-20.017 0 0 0 6.717 0 22.170 0 ; 2.146] 2.253] 0.949] 0.946]
0 0 0 0 4.837 0.173 12.635 0];
91
Though the identification procedure for FMC II algorithm is initialized with five models,
the proposed approach is able to converge to the true number of underlying models i.e. three
due to the iterative statistical testing procedure. In contrast, FMC without statistical testing
converges to five models. While LGC and CLA algorithms [122] were able to converge to
three models, they were not successful in identifying the true models orders. It is interesting to
note that all the converged models using statistical testing for output variable 2 are of the same
order representing unique dependency over the operating range. It can be observed from the
converged models that both the output variables are independent of 𝑢𝑘−1 throughout the
operating range but dependent on 𝑢𝑘−2 representing the delayed effect of input on the output.
To validate the models obtained using different approaches a different set of 400 data
samples (global test) are generated using the same model equations but for a different input
samples set. The inputs and corresponding output values are shown in Figure 4.4. Input is
generated using uniform distribution in the range of [93 113]. K-nearest neighbors approach is
used to identify a suitable model for a new test sample with a K value of 5. Various metrics of
prediction efficiency for training, local test, and global test sets are provided in Table 4.10. It
can be observed from Table 4.10 (bolded values) that the proposed strategy (FMC II with
statistical testing) performs better than the other techniques. Further, the proposed technique
also provides very interesting insights into the process (delayed response, unique dependency,
identification of redundant variables in a multiple learning framework) that the other
techniques do not provide.
The superior performance of the proposed approach is due to the true order models identified
in the training phase. Our technique performs better than NN in our simulation studies. Further,
we also derive interpretable models using our approach. The residual errors of the global test
set using several approaches can be seen in Figure 4.6. It is interesting to note that the NN
92
approach predicts higher values than original in case of y1 and lower values in case of y2 ,
whereas CLA predictions follow the opposite trend.
Figure 4.4 Simulated data - (a) 1000 data samples (training and testing) (b) Global test set
93
Figure 4.5 Plots of 𝑦𝑘 Vs 𝑦𝑘−1 signifying 3 inherent clusters in simulated data for both outputs
Figure 4.6 residual error of the global test set using different approaches
94
Table 4.10 Details of prediction accuracy of different model identification approaches
RMSE R2 value
Algorithm
Train Local Test Global test Train Local test Global test
FMC II (without statistical [8.8 E -4; [9.2 E -4; [0.005; [0.997; [0.997; [0.941;
testing) 0.087] 0.119] 1.236] 0.999] 0.999] 0.940]
FMC II (with statistical [0.001; [0.001; [0.0043; [0.996; [0.996; [0.953;
testing) 0.122] 0.138] 0.943] 0.999] 0.999] 0.965]
LGC [3.4 E -4; [4.1 E -4; [0.0047; [0.999; [0.999 [0.946;
(Z. Wang et al.) 0.0652] 0.071] 0.9714] 0.999] 0.999] 0.963]
CLA [3.3 E -4; [3.7 E -4; [0.0045; [0.999; [0.999 [0.949;
(Z. Wang et al.) 0.0695] 0.071] 1.003] 0.999] 0.999] 0.961]
1 hidden [0.002; [0.002; [0.005; [0.982; [0.981; [0.938;
Neural network layer 0.425] 0.416] 1.022] 0.991] 0.991] 0.959]
(using Resilient 5 hidden [0.004; [0.004; [0.010; [0.945; [0.946; [0.739;
Back layers 0.328] 0.264] 1.793] 0.994] 0.996] 0.874]
propagation) 10 hidden [0.002; [0.002; [0.011; [0.991; [0.991; [0.724;
layers 0.216] 0.143] 2.416] 0.997] 0.998] 0.772]
95
4.6 Conclusion
In this work, we have evaluated prediction error-based FMC approach with statistical
significance testing for both static and dynamic MML problems. We show that statistical
significance testing has an effect in predicting both input partitions and true model orders. In
the case of static MML studies, all models were initialized to sufficiently high order and show
convergence to true model orders. A similar observation is made for dynamic MML problems
as well. A key factor to be noted is that the number of models and initial guesses for orders
need to be greater than that of the underlying model. Though it is difficult to confirm this
assumption from data alone, an iterative increase of model order and a number of models can
ensure convergence. The proposed approach can predict true model information for SMLR
problems of different data sizes and data sampling. It can remove duplicate variables and
identify true model orders. The proposed approach is also shown to be useful in identifying
PWARX models with non-linear dynamics using multiple linear models. The proposed
approach also provides very interesting insights into the process such as identification of true
model orders, delayed response, unique dependency, and redundant variables in a multiple
model learning framework.
In this chapter, we proposed a prediction error based clustering approach with statistical
analysis that can identify underlying models and corresponding significant features set in a
single framework. It is observed that the proposed approach improves the clustering efficiency
and also provides interesting insights into the process. In the next chapter, we use the proposed
approach to obtain a non-linear relationship between structural features derived from first
principles and solvation free energy of Quinone derivatives in a QSPR framework without
using any additional feature selection algorithm.
96
CHAPTER 5
Prediction of solvation free energy of Quinone derivatives using
machine learning approaches in a QSPR framework
Flow battery (FB) is an electrochemical device in which the electrical energy is derived
from the chemical energy stored in the electrolytes. This chemical energy is converted to
electrical energy during discharge. The electrolytes are circulated through the cell during both
charge and discharge. In general, FBs contain two electrolytes, one to store the active materials
for negative electrode reactions and the other to store active materials for positive electrode
reactions [123]. Electrolyte solutions contain both reduced and oxidized form of reactants in
the same phase, where the relative concentrations of oxidized and reduced forms vary over the
course of charge or discharge. Due to its scalable nature, easy decoupling and refueling, flow
batteries are efficient than other storage devices [124]. The low energy density values
possessed by various redox chemistries are the major impediments for flow battery
commercialization. Identifying new electrolyte chemistries with reasonable energy densities
can make flow batteries economically viable. Energy densities of existing electrolytes can be
improved by increasing their solubility by selecting suitable solvents. Soloveichik [125]
reviewed various flow battery technologies in detail along with the technical and economic
challenges and possible remedies for the same.
Quinones are gaining interest as electrolytes for flow batteries in the past few years due to
their ability to transfer two electrons for a single molecule and impressive solubility
characteristics, which results in relatively high energy densities compared to other flow battery
97
chemistries. Quinones also exhibit minimal membrane crossover due to their large molecule
size and can be produced on a large scale with less expenditure compared to vanadium.
Suleyman et al. [126] showed that solubility of Quinones and the reduction potential of
Quinone redox couples can be tuned by substituting various functional groups. They used
Perdew–Burke–Ernzerhof (PBE) based DFT calculations to obtain reduction potential and
solvation free energy values. Quinones can be used as electrolytes on both sides of a flow
battery, hence a rigorous exploration of Quinone derivatives space to identify more efficient
electrolytes is of interest. Quick and robust structure-property relationships are required for
further exploration of Quinones either by substituting with a new set of functional groups or
by substituting with two or more functional groups on a single molecule. These relationships
can be useful for the computationally tractable search of potential molecules in the derivatives
space avoiding computationally expensive DFT simulations.
Group contribution (GC) approaches are well established to estimate a wide variety of
physical and chemical properties ranging from melting point to toxicity of organic molecules
[127]–[131]. These approaches assume that the organic molecules are constructed using
fragments from a predefined set of fragments and the properties of these compounds are
linearly dependent on the occurrences of each fragment. Marrero and Gani[132] proposed an
efficient multilevel group contribution approach, in which the property of interest is initially
regressed with the occurrences of first-order groups. Then the residuals are regressed with the
occurrences of second order groups. Finally, the remaining residuals are related to the
occurrences of third order groups. First order groups are simple functional groups that can form
a molecule structure such that no atom will be counted twice. Second order groups are used to
distinguish between isomers effectively. Third order groups are usually fused and non-fused
rings. Using this multilevel group contribution approach, Marrero and Gani[133] estimated
octanol/water partition coefficient and aqueous solubility of a broad range of compounds
98
ranging from C3 to C70. Correa et al. [134] proposed Analytical Solutions of Groups (ASOG)
group contribution approach to predict water activities in aqueous electrolytes.
Quantitative structure-activity or structure-property relationships (QSAR/QSPRs) are the
mathematical representations of the functional behavior between the biological activity or
chemical response of a component and its quantifiable structural information. This structural
information is denoted in the form of structural descriptors/features such as atom counts,
surface area, refractivity, etc. QSPR/QSARs are widely used in the fields of molecule design,
predictive toxicity and drug design[74] for identifying various properties of organic molecules
such as flash point[135], vapor pressure, water-air partition coefficients[136], water-octanol
partition coefficients[137], solubility[138] and toxicity[139] etc. Any QSPR/QSAR study
involves three major steps, i.e. calculating structural features (descriptors) for the predefined
molecules set, identifying suitable descriptors and obtaining an efficient association between
structural features and the property of interest. Structural features can be obtained using first
principles, theoretical models and platforms like PaDEL-Descriptor, DRAGON, OpenBabel,
etc.[74], which are specifically designed for the calculation of structural features. Selecting
suitable descriptors and obtaining robust models involve a wide range of chemometrics such
as PCA, regression tools and neural networks, etc. Yousefinejad and Hemmateenejad [77]
consolidated various chemometric methods used in both feature selection and model
development phases of QSPR studies. Once a robust QSPR is identified, the structural features
for a specified objective can be obtained in an inverse QSPR framework [140].
Various kinds of feature selection algorithms are proposed in literature i.e. classical methods
such as forward selection and backward selection [141], artificial intelligence based methods
such as genetic algorithm (GA) [117], particle swarm optimization (PSO) based approaches
[142] and dimensionality reduction based approaches such as principal component analysis etc.
Forward selection approach starts with zero descriptors and in each step, one new descriptor is
99
added based on predefined criteria until a stopping criterion is satisfied. Backward selection
approach starts with the complete set of descriptors and in each step, a new descriptor will be
removed based on predefined criteria until satisfies stopping criterion is satisfied. In stepwise
selection, a combination of both forward and backward selection at each step is shown to be
more robust. GA and PSO frameworks formulate feature selection as an optimization problem
with binary variables i.e. each variable corresponds to the decision of whether a feature should
be considered or not. Principal component based approaches obtain few linear combinations of
original descriptors, which can explain maximum variability in the data.
In the era of machine learning, due to the availability of a wide range of modeling
techniques, selecting a suitable modeling method is also crucial to obtain a robust structure
property relationship. Each modeling technique has its own advantages and disadvantages.
Multivariate linear regression can be used if the dependency of the property of interest on
selected features is anticipated to be linear [143]. Principal component regression [144], in the
case of relationships among inputs, polynomial regression and artificial neural networks
(ANN) [145] in cases where nonlinear relationship exist between selected features and property
of interest can be explored. Though ANN models can fit very complex nonlinear behavior,
interpretability and overfitting are major issues. Piecewise linear models have been shown to
mimic nonlinear behavior using piecewise linear assumptions [146]. In our earlier work [147],
piecewise linear models were used to fit the non-linear behavior in order to obtain a robust
QSPR to predict drug solubility in binary systems. We proposed a prediction error based
clustering approach in our previous work [83], which can identify the significant features as
well as operating models in a single framework.
In this work, initially, a group contribution based approach is employed to correlate the
structure of Quinone derivatives to their solvation free energy values provided in the
literature[126]. Later, various QSPR based approaches are used to correlate structural features
100
of Quinone derivatives to their solvation free energy values. A brief overview of group
contribution approaches, QSPR approaches and useful chemometric approaches are provided
in this section. In the following section, a problem specific group contribution approach is
described to obtain the solvation free energy. In section 3, three different QSPR approaches i.e.
linear, neural network and piecewise linear models are employed to obtain a robust QSPR.
Finally, this chapter concludes with comments and discussions on the efficiency of the
proposed approaches.
5.2 Group contribution approach
Group contribution (GC) approaches assume that the property of interest of a compound is
a function of a predefined set of structural fragments and it is computed by summing the
frequency of each group occurring in the molecule times its contribution [132]. The group
contribution framework to obtain the properties of interest of organic molecules is shown in
Figure 5.1. GC approach involves two steps, initially, the occurrences of each fragment are
counted and then a linear relationship is obtained between the property of interest and the
occurrences of each fragment to obtain the contribution of each fragment. The contributions
(𝐶) of all fragments are calculated using equation 1, where 𝑓(𝑋) is the property of interest of
molecule 𝑋, 𝑁𝑖 is the number of times fragment 𝑖 occurred in molecule 𝑋 and 𝐶𝑖 is the
contribution of fragment 𝑖. Now, for any new test molecule, the occurrences of each fragment
are evaluated and substituted in the linear relationship (in equation 1) to obtain the property of
interest of the test molecule.
n
f  X    N i Ci (4.1)
i 1
101
In this case study, data set [126] for the solvation free energy estimation includes three
variants of Quinones i.e. benzoquinone, naphthoquinone, and anthraquinone substituted with
18 functional groups. To differentiate the three variants of Quinones and the 18 functional
groups, we considered 41 different types of fragments, which are specific for this case study.
38 out of these 41 fragments are first-order groups, whereas the remaining 3 groups belong to
the second-order, which are useful to differentiate between the Quinone types. The data set
contains 407 data samples. Initially, the data is randomly divided into ‘model’ and ‘global test’
data sets with 80% and 20% of data samples respectively. Model data set is used to obtain
contributions of each group (i.e. model parameters) in association with K-fold (K as 5)
validation approach. In each run, the model data set is again randomly divided into K-equal
partitions and each time data in K-1 folds are used to train the model and the remaining to test.
This procedure is repeated for 100 random runs and the model parameters (contributions of all
groups) are averaged and reported as the final set of parameters in Table 5.1 along with the
performance metrics.
Figure 5.1 Group contribution approach framework
102
Table 5.1 All 41 groups that are considered for the case study along with contributions
Group Contribution Group Contribution Group Contribution

aC-H -0.8263 aC-COOH -21.1883 C-CHO -6.7621
aC-N(CH3)2 -1.127 aC-PO3H2 -28.4856 C-COOCH3 -6.3738
aC-NH2 -14.5622 aC-SO3H -14.5847 C-CF3 2.7074
aC-OCH3 0.8025 aC-NO2 -0.9337 C-CN -9.4303
aC-OH -11.5993 C-H -2.7929 C-COOH -22.5123
aC-SH 0.0749 C-N(CH3)2 -6.6628 C-PO3H2 -32.2944
aC-CH3 0.6223 C-NH2 -13.747 C-SO3H -20.3163
aC-SiH3 3.7311 C-OCH3 3.3567 C-NO2 0.0017
aC-F 3.7044 C-OH -10.9846 both C=0

on different -22.2017
aC-Cl 2.7168 C-SH 0.5817 rings
aC-C2H3 -0.5199 C-CH3 -4.2883 both C=0

side by side -27.1104
aC-CHO -7.066 C-SiH3 1.5231 of same ring
aC-COOCH3 -5.1764 C-F 1.4416 both C=0 in

opposite
aC-CF3 2.3128 C-Cl -10.7412 -16.4325
sides of
aC-CN -11.117 C-C2H3 -1.6885 same ring
103
It can be observed from the contributions values (bolded values in Table 5.1) that
substituting with PO3H2 can increase the solubility (low solvation free energy) followed by
COOH, SO3H, and NH2 as suggested in the literature [126]. It is also interesting to note that
from the second order functional group contributions (italic values in Table 5.1) having two
C=0 groups side by side in a ring can increase the solubility than having two C=0 groups
opposite to each other. This can be validated by comparing the solvation free energy values of
1,2-BQ, 1,2-NQ and 1,2-AQ variants (i.e. substituted with the functional groups) with 1,4-BQ,
1,4-NQ and 1,4-AQ variants respectively. The performance metrics of the GC approach to
estimate solvation free energy can be obtained in Table 5.4. Though considering more
fragments to differentiate isomers effectively can improve the performance of the GC
approach, the size of the data is an impediment in this case study. The major setback of GC
approaches is that the property of interest of any new molecule which contains the fragments,
which are not included in the training set cannot be evaluated.
5.3 QSPR based approaches
Identifying QSPR consists of three phases, i.e., data generation, feature selection, and model
prediction. In the data generation phase, chemical structures are converted into an accessible
form such as .mol, .smi, etc. to calculate structural feature values. Structural features or
descriptors can be obtained from first principles models[148], experimental methods and
several platforms designed for estimating structural features such as MOPAC, OpenBabel, and
PaDEL-Descriptor [74], etc. Feature selection involves both domain and mathematical
knowledge to identify the significant and independent features set that affects the property of
interest. Feature selection algorithms such as forward selection, backward selection, stepwise
regression, and evolutionary optimization approaches are mathematical ways of exploring the
most suitable feature subset to reduce the model complexity, thus avoiding overfitting of
models[75]. The model prediction is the process of identifying a robust model between the
104
significant features set and the property of interest. QSPR framework to estimate any end
property of organic molecules is depicted in Figure 5.2.
Start
Calculate the descriptors  MOPAC

 CODESSA PRO
using available tools  PADEL-descriptor
Select independent  Genetic algorithm

 Stepwise algorithm
significant features  PCA
Obtain the relationship  Linear regression

 Non-linear regression
using available tools  Neural networks
Stop
Figure 5.2 QSPR framework to estimate the property of interest of organic molecules
PaDEL-Descriptor [81] is an openly available software to compute various kinds of
structural features of molecules varying from topological parameters to chemical fingerprints.
In this case study, for QSPR estimation, the solvation free energy data of 407 Quinone variants
provided in the literature [126] is used. Structure files of all 407 Quinone variants are generated
in smiles (.smi) format and processed to obtain the structural features using PaDEL-Descriptor.
McGowan characteristic volume (McG_Vol), Molecular weight (MW), Van der Waals volume
(VABC), first ionization potentials (Si), sum of atomic polarizabilities (Apol), solvent
accessible surface area (TSA), topological polar surface area (TopoPSA), combined
polarizability (MLFER_S), excessive molar refraction (MLFER_E), Molar refractivity
(AMR), overall hydrogen bond basicity (MLFER_BH) and acidity (MLFER_A) values of the
105
molecule are found to affect solvation free energy values [72], [147], [149] hence these are
considered as structural features for this case study. Since the structural features can be of
different magnitudes, to avoid the influence of any particular variable on the model parameters,
features are scaled individually by mean centric scaling using the mean and standard deviation
of a particular feature.
5.3.1 Single linear model based QSPR
In this case study, initially, we identify the significant features using K-fold validation (K value
as 10) in conjunction with F-test. For feature selection, a linear relationship is assumed between
the 12 descriptors set and solvation free energy. The model parameters obtained in each fold
are averaged and each parameter is tested with F-test to check if it is significant or not. It is
identified that out of 12 variables, 7 variables i.e. McG_Vol, MLFER_A, MLFER_BH, MW,
MLFER_S, MLFER_E, and TopoPSA are significant. Later, a linear relationship is obtained
between the above identified significant features set and solvation free energy using ordinary
least squares associated with K-fold validation (K value as 5).
Initially, the data is randomized and divided into ‘model’ and ‘global test’ data sets with
80% and 20% of data samples respectively. Data in ‘model’ data set is randomized and equally
divided into K-partitions and each time data in K-1 partitions are used to train the model and
the trained model is tested on remaining data. Model coefficients obtained in all K-folds are
averaged for 100 random runs and reported as final model parameters. Performance metrics of
linear relationship obtained on the model data set, global test set and overall data set can be
obtained in Table 5.4. The poor performance (R2 value as 0.6395) of a single linear model
suggests that the linear behavior assumption may not be valid hence a non-linear model can be
anticipated to increase the prediction accuracy. In following subsections, neural network and
piecewise linear based models are tested to obtain robust non-linear models.
106
5.3.2 Neural network based QSPR
Neural networks are highly recommended to mimic non-linear behavior due to their ability
to capture complex functions. Jalali-Heravi et al.[150] concluded that the Levenberg-
Marquardt algorithm is more suitable for QSPR prediction compared to other training
approaches such as backpropagation and conjugate gradient algorithms. In this case study, a
Levenberg-Marquardt algorithm based neural network is trained to identify the relationship
between structural features and solvation free energy. In this case study, to identify significant
features, a backward stepwise approach[151] is used. In this approach, if n variables have to
be ranked, initially n networks with n-1 different variables have to be trained on training data
set. The nth missed out variable for which the network results in the largest error on test data
set is considered to be the most important. To identify the next most significant variable, the
current important variable is removed and the above procedure is repeated. This procedure is
continued until all n variables or the first m (<n) important variables get individual rankings.
In this case study, feature selection is carried on 12 input variables (structural features) with 1
hidden layer architecture. The ranking procedure described above is repeated for 100 random
runs and the consolidated rankings are reported in Table 5.2.
The first seven important variables are used to obtain the final network to estimate solvation
free energy. Data is randomly divided into ‘model’ and ‘global test’ data sets with 80% and
20% of data samples respectively. Initially, to obtain optimal network architecture, networks
with hidden layer sizes ranging from 1 to 7 are tested for 10 random runs. In each run, a neural
network is trained for all 7 configurations on the model data set and tested on testing data set.
The adjusted root mean squared error values of all 7 networks are averaged for all 10 runs and
the network architecture with the least averaged error on the test data set is assumed to be the
optimal network. The mean adjusted RMSE values of networks with 1 to 7 hidden layers on
test data set are 30.2237, 30.8945, 29.0444, 30.1465, 33.3646, 36.7107 and 44.5868
107
respectively, hence the network with 3 hidden layers is considered to be optimal. Once the
optimal network architecture is obtained, then a neural network with a similar architecture is
trained on model data set for 100 random runs. The network with the least root mean squared
error on the test data set is considered to be the final network to predict solvation free energy.
Prediction accuracy of the final neural network model on model data set, test data set and on
overall data is given in Table 5.4. It can be observed from Table 5.4 that neural network based
QSPR performs better than the single linear model based QSPR due to the ability of NN to
mimic the non-linear behavior.
Table 5.2 Features ranking obtained using a stepwise approach for NN-QSPR
Variable Rank Variable Rank Variable Rank Variable Rank
MLFER_A 1 MLFER_E 4 AMR 7 Apol 10
TopoPSA 2 TSA 5 MW 8 Si 11
MLFER_BH 3 MLFER_S 6 McG_Vol 9 VABC 12
5.3.3 Multiple model based QSPR
Piecewise linear models have been shown to identify non-linear behavior[79], [80] and are also
easily interpretable. Identifying piecewise linear models and their operating regions is known
as multiple model learning. In this case study, we assume solvation free energy values are
linearly dependent on the structural features with different hyperplanes in different regions. A
fuzzy clustering approach based on prediction error is used to obtain the operating models [83].
The major advantage of this approach is that both feature selection and model identification
are included in a single framework. It is interesting to note that in different operating regions
different structural features can be significant. The data is randomly divided into ‘model’ and
‘global test’ data sets with 80% and 20% of data samples respectively. In this approach,
108
initially, the number of underlying models and their true orders (i.e. significant features in each
operating region) are identified using the clustering approach [83] on model data set. Later, the
clustering procedure is initiated with true models and their orders associated with K-fold (K
value as 4) validation on model data set to obtain the final model parameters.
Details of the PE based multiple model clustering approach using statistical analysis [83]:
1. Initially, randomly generate N vectors of predefined orders with different parameter
values. Each vector with a different set of parameter values denotes a different cluster.
2. Obtain the prediction error of sample j with respect to each randomly generated cluster
i using Equation (2.10)
3. Calculate fuzzy membership of sample j with respect to cluster i using Equation (2.11)
4. Update the model parameters (cluster centers) using the gradient descent algorithm
given in Equation (2.12)
5. Calculate the prediction errors for all samples with respect to the updated models
6. Compute root mean square error using Equation (2.15)
7. Terminate if any pre-specified criteria is satisfied (number of iterations exceeds the
limit or RMSE less than the predefined limit ) and go to next step else go to step 3
8. Assign each data sample to respective clusters based on prediction errors
9. Calculate the cosine angle between each model to the others using Equation (2.16) and
merge like models
10. The models that have fewer data samples (<0.05M) are discarded and the data points
are reassigned to models that fit them best
11. Calculate the final model parameters using ordinary least squares (OLS).
Once the final set of models are obtained at the end of an iteration (step 1 to 11), each model
is tested with F-test [118] to identify whether a particular variable is significant or not.
109
12. Each model is tested using F-test, and if any variable in a particular model is identified
as insignificant then that variable will be removed thus reducing the model order.
13. If any of the models contain insignificant variables then the whole clustering approach
is restarted with a new number of models and their orders i.e. go to step 1 with modified
‘N’ and their individual orders, else report the final model orders and parameters.
In this study, the multiple model learning approach contains two stages. In the first stage, to
obtain the true number of models and their individual orders, the clustering approach described
above is used on the model data set. In the first iteration, in step 1, five models are initiated
with 13 input variables (12 scaled structural features and intercept) with random model
parameters. From the second iteration onwards, the models are initiated with the final set of
models that are obtained in the previous iteration. Once the true models and their corresponding
model orders are obtained, in the second stage, to obtain robust model parameters we use a K-
fold (K value as 5) validation based approach. Model data set is equally divided into K equal
partitions. In each fold, data in K-1 partitions are used to build the model and remaining data
to test obtained models. In K-fold validation, switching 20% of data samples each time for a
new fold results in a substantial deviation in model parameters. Hence, to obtain robust multiple
models, an iterative weight based optimization approach is used [147], in which the models
information in the previous fold is included in form of weight-based objective for the next fold.
In this approach, the final set of models in step 11 are obtained using a weight based
optimization approach for the objective specified in Equation (4.2) with  value as 10 and the
final model parameters  C prev  obtained in the previous fold as an initial guess. In case of the
first fold of the first phase, both in step 1 (initiation of the models) and in step 11 (for the
optimization problem) the model parameters obtained from the first stage (i.e. identification of
the true number of models and model orders) are used. This procedure is repeated until the
110
respective models in all the folds are relatively close, which can be measured using the
similarity metric   proposed in our earlier work [147]. In the second stage, models are
neither merged nor discarded (i.e. step 9 and 10) and statistical testing (i.e. step 12 and 13) is
also avoided since clustering procedure in this stage is assumed to start with the true number
of models and their orders.
  Ntr 2 2
Weight based objective: min     yi  yipred      C  C prev   (4.2)
C
  i 1  
Similarity metric:   max i , k  ; i  N ; k  K (4.3)
where    i, j 
i ,k  max min  k ' ; j  N ; k '  K  k '  k ; (4.4)
The pseudo code for the identification of multiple model parameters (second stage) is as follows:
Do while:
For fold in 1 to K-folds:
Divide the whole data into respective training and testing data sets
Initialize the clusters with the final model parameters provided in the previous fold.
Follow the clustering procedure provided above from step 2 to step 8
Obtain the final model parameters using a weight based optimization approach for the
objective specified in Equation (4.2) with the final model parameters obtained in the
previous fold. End
Obtain the similarity metric of the multiple models obtained in all folds
If the similarity metric is smaller than the tolerance or larger than the similarity metric in the
previous iteration, then terminate and report the averaged model parameters respectively over
all the K-folds as the final set of model parameters, else, continue. End
111
In our earlier work [147], a prediction error based K-nearest neighbors testing approach is
proposed in order to identify an appropriate model for a new test sample. In this case study, we
used a weighted prediction error based K-nearest neighbors method. The weights provided are
inversely proportional to the distance from the test sample to the neighbor i.e. a neighbor, which
is nearest to the test sample will have more impact than a neighbor which is far from the test
sample. The clustering procedure is initiated assuming five linear models of order 13 (12 scaled
structural features and intercept) and it converges to three models of different orders.
It is interesting to note that in different operating regions of feature space, different features
are found to be significant. The details of the converged models along with the number of data
points (of model data set) that belong to each model is given in Table 5.3. It is interesting to
note that VABC, MW, and AMR are found to be insignificant in the whole feature space, which
is also validated by the neural network. It can be observed from Table 5.3 that though
MLFER_A, MLFER_BH, MLFER_S, and MLFER_E are found to be more significant in the
features set, no feature is found to be significant in the complete feature space. It is interesting
to note from the coefficients reported in Table 5.3 that solubility is directly proportional to
hydrogen bond acidity and combined polarizability and inversely proportional to the excessive
molar refraction in the complete feature space but with different proportionality constants.
All the data samples are associated with the final set of models obtained in the above
iterative procedure based on the final prediction error. Association of these data samples is
further useful to select a suitable model for any new test sample (i.e. for global test set or for
any novel molecule). Prediction accuracy of the final set of multiple models on model data set,
test data set and on overall data is given in Table 5.4. It can be seen from the table that multiple
models perform better than any other approach. The adjusted RMSE and R2 values demonstrate
that the final set of multiple linear models can be used to predict the solvation free energy of
Quinone molecules. Due to the capability of neural networks in identifying the non-linear
112
dynamics, the neural network (NN) based QSPR approach is shown to be better than the OLS
approach; however, both approaches were not robust to estimate solvation free energy for a
wide range of Quinone derivatives. It is interesting to note that though the estimates of the GC
method on model data set is slightly better than the NN based approach, NN estimates are better
on the global test set. This can be attributed to the overfitting of contributions in the group
contribution approach.
The solvation free energy values obtained using several approaches are plotted against their
original values in Figure 5.3. It can be verified from the figure that multiple linear models have
better estimates compared to other approaches throughout the range of solvation free energy
values, especially for high solubility molecules (samples having solvation free energy values
ranging between -150 to -310 kJ/mol).
Table 5.3 Details of the final set of multiple models converged
Model Active features and their coefficients in final averaged models No. of samples
Apol, Si, McG_Vol, MLFER_BH, MLFER_S, MLFER_E

1 127
[-153.670 51.156 101.901 -30.166 -8.982 24.902]
MLFER_A, MLFER_BH, MLFER_S, TSA
2 22
[ -22.249 35.523 -56.263 57.925]
Si, MLFER_A, MLFER_E, TopoPSA
3 176
[-9.876 -5.696 16.634 -56.429]
113
Table 5.4 Performance metrics of several approaches for solvation free energy estimation
2 2
RMSE Adj. RMSE R value Adj. R
Model 21.5371 23.0393 0.7806 0.7488
Group
G test 17.2844 24.4439 0.8728 0.7424
Contribution
Over all 20.7505 21.8820 0.8019 0.7796
Model 28.5975 28.9561 0.6132 0.6034
Single linear G test 25.4517 26.7921 0.7242 0.6940
Over all 27.9921 28.2714 0.6395 0.6322
Model 22.3767 23.4077 0.7632 0.7408
Neural
G test 16.2358 20.0070 0.8878 0.8285
network
Over all 21.2825 22.0546 0.7916 0.7762
Model 12.1045 12.4340 0.9307 0.9269
Multiple
G test 15.4585 17.3628 0.8983 0.8712
linear
Over all 12.8508 13.1279 0.9240 0.9207
Figure 5.3 Solvation free energy original vs predicted using several approaches
114
5.4 Conclusion
In this study, various machine learning based QSPR approaches along with the GC approach
have been tested to predict solvation free energy of Quinone derivatives for further exploration
of Quinones in an inverse optimization framework. In this study, we used adjusted root mean
squared error and adjusted R2 values for an unbiased comparison of various approaches since
these approaches are parameter sensitive i.e. a neural network with more hidden layers can
capture highly complex behavior. It is observed from the reported metrics that multiple models
based QSPR approach performs better than the other approaches. It is identified from the GC
approach that substituting hydrogen atoms with groups like PO3H2, COOH, etc. can increase
the solubility of Quinones. Group contribution approach estimates are restricted by the number
of groups considered for this case study. These estimates can be further improved by assuming
non-linear contributions of groups. It can be observed from the QSPR case studies, that
structural features like overall hydrogen bond basicity (MLFER_BH) and acidity (MLFER_A),
combined polarizability (MLFER_S) and excessive molar refraction (MLFER_E) of the solute
affect the aqueous solubility more compared to the other structural features. This work can be
further extended to obtain a robust model to estimate reduction potential of Quinone molecules
and by using these two models other Quinone variants can be explored in an inverse multi-
objective optimization framework to obtain potential molecules for flow battery applications.
In this chapter, we have tested the efficacy of the clustering approach proposed in Chapter
4 to obtain the underlying partitions and significant features in each partition in a single
framework to improve the performance of prediction error based clustering approaches. Taking
note of the superior performance of piecewise linear models compared to the single linear and
neural network based approaches for model identification, in the next chapter, we propose a
piecewise SVM classifier based on prediction error to obtain nonlinear boundaries for binary
classification problems.
115
CHAPTER 6
An adaptive prediction error based multiple model SVM
classifier for binary classification problems
In the recent years, usage of machine learning is exponentially increasing in several fields.
Use of machine learning for binary classification problems is of particular interest in multiple
areas. For example, predicting whether a patient is suffering from a disease using image
recognition or whether a loan can be issued to an applicant based on his/her past credit history
are examples of binary classification problems. Some of the widely used classifiers for general
classification problems (including the binary variety) can be seen in Figure 6.1. Support vector
machines (SVMs) [152] are useful for modeling both linear and non-linear boundaries between
the classes using kernels but the performance of SVM depends on the kernels chosen [153] and
there exists no heuristics for selecting a suitable kernel for a given problem. Neural networks
have the flexibility to mimic highly non-linear functional behavior but the performance is
highly dependent on the selected network architecture and algorithmic parameters [154].
Interpretability is another drawback for neural network based classifiers.
Decision trees [155] and Random forests [156] are logic based classifiers, where the data is
divided into classes based on certain conditions at each node. Though logic based classifiers
have a simple design and good interpretability they are very sensitive to the data. Piecewise
classifiers [157] assumes that the data in different regions of feature space behave differently
and hence a classifier is derived for each subregion. Most of the piecewise classifiers assume
116
the number of regions as known, which may not be true in most real-life case studies. Logistic
regression [158] is a statistical classifier, which provides the probability for a data sample to
belong to a certain class. Though it takes less training time and provides good interpretability,
it underperforms in case of complex behavior between features and output. Kotsiantis et al.
[159] provided a detailed review of various classification approaches.
Multiple classifier systems (ensemble of classifiers) have also been proposed for various
applications based on the motivation that a pool of classifiers together can give better
predictions than a standalone classifier. Any multiple classifier system such as bagging,
random subspace, and class switching, etc. initially generate modified training data sets based
on certain criteria. Different classifiers are then built based on these data sets and the results
are combined them into a final decision, usually based on majority voting [160]. Bagging [161]
randomly generates a predefined number of training sets and each data set is used to build a
classifier. Random subspace [162] trains a predefined number of classifiers with a different
subset of features each time. It is anticipated that the classification boundaries are combined in
some way to mimic the original classification boundary. Nanni and Lumini [160], [163]
examined various ensemble classifier approaches to predict credit score and biometric
verification and concluded that random subspace approach is efficient than the other
approaches. Several attempts have made to combine SVM with various techniques such as
decision trees [164], [165], Markov models [166] and particle swarm optimization [167].
Ghodselahi [168] proposed a hybrid classifier combining SVM classification approach and
fuzzy C-means clustering algorithm to predict credit score. Initially, the data is clustered into
a predefined number of clusters using FCM approach and an SVM classifier is trained on the
data in each cluster and these clusters are then combined based on a weighted fusion agents
approach. Weights for each agent (classifier) are computed based on the membership of the
point to the corresponding cluster. Rahman and Tasnim [169] provided a comprehensive
117
review of various commonly used ensemble classifiers along with some application driven
classifiers.
Sklansky and Michelotti [157] introduced the idea of the piecewise linear classifier using a
two-stage approach, in which they first identify clusters of close-opposed pairs of data samples
and then a decision surface is constructed using adjacency matrix and switching theory. In the
past few decades, piecewise linear (PWL) classifiers have been applied for a broad range of
applications such as intelligent cameras, autonomous mobile robots, portable devices,
automated visual surveillance systems, monitoring systems, and industrial vision systems
[170], etc. to approximate the non-linear decision boundary. The advantages of PWL classifiers
are: easy implementation, low memory requirements and real-time classification [171].
Currently existing PWL classifiers can be categorized into two types. The first category of
methods [157], [172]–[174] follow a two-stage procedure, in which, in the initial stage, they
obtain a classifier in each segment and then in the final stage, they combine the identified
hyperplanes to obtain the final decision boundary. Of these approaches, Bagirov et al. [174]
approach of applying a max-min separability algorithm only on indeterminate regions has
drawn considerable attention due to its simple design and reduced computational complexity.
Initially, a hyper-box is identified for each class and then in the regions where classes overlap
a max-min separability is used to separate these data into classes.
The second category of methods [175]–[177] solve a single optimization problem to obtain
the whole piecewise linear boundary. These methods assume that the number of models are
known a priori. The optimization problem formulated in such methods are very complex thus
resulting in local optimal solutions, a major drawback of these methods. Astorino and Gaudioso
[175] proposed a polyhedral based binary classifier, in which data from a particular class is
enclosed using a predefined number of piecewise classifiers and data samples outside the
polyhedral are assumed to belong to the other classes. Bagirov [176] proposed a max-min
118
separability approach, which is theoretically proven to obtain the global optimal solution for
classification provided the classes are completely separable, i.e., there exists no overlap in the
feature space of any two classes. Huang et al. [177] theoretically proved that any PWL
boundary can be identified using PWL feature mapping. They proposed two PWL classifier
approaches by combining the idea of PWL mapping with the traditional SVM [152] and least
squares SVM [178] and compared the performance with the existing literature approaches on
various synthetic and real data sets. The performance of these algorithms depend on the non-
linear parameters that are selected. Prior knowledge will allow these parameters to be selected
appropriately, tuning the parameters randomly when no prior information is available is usually
difficult. Most of the existing PWL classifiers are designed for binary classification problems.
Classifiers designed for binary classification problems can be further extended to multi-class
by decomposing the multi-class problems into a series of binary classification problems [179].
Kostin [170] proposed a binary proportion based tree approach to extend binary piecewise
classifiers to solve multi-class problems.
In the case of function approximation, several algorithms have been proposed based on both
Euclidian and prediction error based approaches to identify non-linear behavior using
piecewise linear assumptions [79], [80]. In this chapter, we propose a new multiple model
piecewise SVM approach for binary classification problems and test this approach on various
synthetic and real data sets. The chapter is organized as following – initially, a brief
introduction to classification and various techniques for solving classification problems with
linear, non-linear and piecewise linear boundaries are provided. In section 2, the mathematical
representation of traditional SVM for a binary classification problem is provided. In section 3,
the details of proposed piecewise SVM and a testing strategy are provided. In section 4, the
performance of the proposed algorithm on both synthetic and real data sets are provided.
Finally, this chapter is concluded with comments on the efficiency of the proposed approach.
119
Classification
Statistical Piecewise classifiers

SVM Neural networks Decision trees
classifiers
X2
X2
Y X1
X1>0.5 X1<0.5
X1
Y X2 No
X2
X2>0.2 X2<0.2
X X1
X1 Yes No
Figure 6.1 Some of the widely used machine learning techniques for classification problems
120
6.2 Support vector machines
Support vector machine (SVM) identifies a hyperplane (linear boundary) such that the
distance from the plane to the nearest data samples in both classes are maximum. This
maximization problem is scaled and converted to a minimization problem as the following:
1
min w s.t. yi  wxi  b   1 i  1...N (5.1)
w ,b 2
Where 𝑥𝑖 is the input vector of sample 𝑖 and 𝑦𝑖 is the output of that sample, which is either 1
or -1 and 𝑤 is the coefficients vector representing the separating hyper plane and 𝑏 is the bias.
This problem is reformulated using Lagrangian multipliers approach:
N
w    i  yi  wxi  b   1
1
min s.t.  i  0 i  1...N (5.2)
w ,b 2 i 1
At the stationary point, by equating the gradient for the primal problem with respect to both w
and b to zero.
f N N
 w    i yi xi  0  w    i yi xi (5.3)
w i 1 i 1
f N
   i yi  0 (5.4)
b i 1
Substitute these values in Equation (5.2) and solve the dual problem as following:
N N N N
max i  i j yi y j xi' x j s.t. i  0 & i yi  0 i  1...N (5.5)
i
i 1 i 1 j 1 i 1
The data samples with non-zero 𝛼𝑖 in the optimal solution are called as support vectors and
the support vector plane 𝑤 is estimated using the support vectors. In the case of linearly non-
separable data, a soft margin SVM will be identified such that the number of misclassifications
are minimum.
121
The optimization problem for soft margin SVM is formulated as following:
N
1
min w    i s.t. yi  wxi  b   1  i & i  0 i  1...N (5.6)
w,b , 2
i 1
To relax the constraints in Equation(5.1), positive slack variables i  are used and
minimization of slack variables is introduced in the objective function using a weight  . If 
is considered as infinite then the problem tends to reach the traditional SVM solved earlier in
Equation(5.1). As the  value tends to zero, the width of the soft margin will be increased and
the misclassifications will be neglected. This problem is reformulated using Lagrangian
multipliers approach as following:
N N N
w  C  i   i  yi  wxi  b   1  i     ii
1
min s.t.  i , i , i  0 i  1...N (5.7)
w,b 2 i 1 i 1 i 1
At the stationary point, by equating gradient for the primal with respect to 𝑤, 𝑏 and  to zero
results the following:
N N
w    i yi xi ;  y i i  0;  i    i ; s.t.  i , i , i  0 i  1...N (5.8)
i 1 i 1
Substitute these values in the equation (5.7) and solve the dual problem as following:
N N N N
max i  i j yi y j xi' x j s.t.   i  0 & i yi  0 i  1...N (5.9)
i
i 1 i 1 j 1 i 1
The data samples with non-zero 𝛼𝑖 in the optimal solution are called as support vectors and 
are termed as box constraints.
6.3 Multiple model based SVM
In our earlier work [78], a prediction error based soft clustering approach is proposed to
identify the non-linear behavior using piecewise linear assumption for function approximation
problems. In this clustering algorithm [78], a least squares objective for update of multiple
122
models is formulated such that the weights for the data samples are proportional to prediction
errors with respect to each model. It is interesting to note that in the case of function
approximation, all the data samples in training set will be considered to obtain the models,
whereas in the case of classification the data samples at boundaries alone (implicitly) will be
considered to obtain the classification boundary. This fact constrains us from modeling the
multiple model classifier in a complete fuzzy framework. In this section, a new hybrid multiple
model based SVM approach is proposed by combining the ideas of soft and hard clustering
procedures. In this approach during update of clusters, it is assumed that a data sample belongs
to all the models that have classified the data sample correctly and will be used for further
update of those models i.e. membership for that data sample for all such models will be
considered as one and for remaining models as zero (fuzzy clustering idea). For the data
samples that are wrongly classified by all the models, these data samples are assigned to the
models that contain the nearest support vector to the data sample i.e. for all the wrongly
classified data samples, membership will be one only for one model and zero for the remaining
models (hard clustering). Once the cluster update is terminated, data samples are assigned to
final SVM models based on hard clustering procedure.
In this approach, the true number of models is not assumed beforehand but identified in an
iterative fashion using the accuracy of the classifier based on the test data set. To avoid local
optimal convergence with respect to the number of models, for each 𝑁 value we test the
accuracy for the next two values i.e. 𝑁 + 1 and 𝑁 + 2 before termination. The pseudo code of
the proposed approach is as follows:
Start Define the algorithmic parameters such as the maximum number of multiple
models (𝑁𝑚𝑎𝑥 ), and the maximum number 𝑀𝑖𝑡𝑒𝑟 of cluster updates.
Initialize 𝑁 with 1
While (𝑁 ≤ 𝑁𝑚𝑎𝑥 ):
123
Divide the training data randomly into 𝑁 equal partitions
For 𝑖𝑡𝑒𝑟 in 1 to 𝑀𝑖𝑡𝑒𝑟 :
Obtain SVM models on the data samples that are associated with each model (partition).
Calculate the prediction error for each data sample with respect to all 𝑁 SVM models
PEi , j  Erri , j  max   , Di , j  i  1 n, j  1 N (5.10)
0
Erri , j  
0
if Yi ,pj  Yi
if Y  Yi
p
i, j
 
; Di , j  min Disv, j ; Disv, j  xi  x j , sv ;
2
(5.11)
Where 𝜀 is a small value to restrict the error to not to be zero for wrongly classified data
points, ̅̅̅̅̅
𝑠𝑣 is the distance vector from data sample 𝑖 to all the support vectors of model
𝐷𝑖,𝑗
𝑗 and ̅̅̅̅̅
𝑥𝑗,𝑠𝑣 is the input features corresponding to all the support vectors of model 𝑗.
Calculate the membership for each data sample with respect to all 𝑁 SVM models
1 j : PEi , j  0 

else 
 
 if min PEi  0
0
i , j 
 
(5.12)
j : PEi , j  min PEi 
1
  
 if min PEi  0
0 else 
Assign the data samples to each model, which have the membership value as 1. End
To assign the data samples to final SVM models modify the membership values of all the
correctly classified data points as following.
1
i , j  
 ;
j : Di , j  min Di
 
if min PEi  0 i  1 n (5.13)
 0 else
Reassign the data samples to each model, which have the membership value as 1.
Obtain the accuracy of given models on testing data set using any testing strategy
If accuracy values for two consecutive number of models are less than the earlier then
break. Else, increase 𝑁 by 1. End
Report the true number of models and their corresponding SVM models. Stop
124
In the proposed algorithm, the boundaries between the final hyperplanes (SVM models) are
not explicitly identified due to the complexity involved; instead, we use a non-parametric
testing approach to identify a suitable hyperplane. We propose a weighted 𝐾 nearest neighbours
(WKNN) approach to identify the model for classifying any new data sample. The weight for
a neighbour is inversely proportional to the distance from the data sample to the neighbour.
Initially, calculate distances from the test sample to all the samples in the training data and
obtain the 𝐾 nearest samples along with the distances. Next, calculate the prediction errors for
all 𝐾 nearest samples with respect to each model. Finally, obtain the overall error for each
model using the following expression and select the model with the least overall error for
classifying the test sample.
where Wk   xk  xtest 
K
1
OE j   Wk PEk , j j  1 N true ; 2
(5.14)
k 1
6.4 Evaluation of proposed binary classifier
In this section, we evaluate the performance of the proposed approach on various synthetic
and real data sets. Algorithmic parameters for all the case studies are fixed as follows – the
maximum number of piecewise linear models (𝑁𝑚𝑎𝑥 ) is fixed to 10 (interestingly all the case
studies terminated earlier i.e. the true number of piecewise linear models are less than 10), the
maximum number of iterations for cluster update is set to 100 and 𝐾 value for the proposed
WKNN testing approach is fixed to 15.
6.4.1 Synthetic case studies
In this section, we report computational studies that test the efficacy of the proposed
approach using a two-layer testing approach on two synthetic case studies in which the data is
divided using a polynomial boundary. The objective for these tests is to verify if the proposed
piecewise SVM can identify an appropriate number of hyperplanes that can mimic the non-
125
linear boundary and corresponding model parameters without any tuning parameters and prior
assumptions about the system. Initially, data is randomly divided into ‘model’ and ‘global test’
data sets with 80% and 20% of data samples respectively. Model data set is used to obtain the
piecewise SVM models in association with K-fold (K as 4) validation approach. In each run,
the model data set is again randomly divided into K-equal partitions and each time data in K-1
folds are used to train the model and the remaining to test. The efficacy of obtained models is
tested on both K-fold test data set and global test set using the proposed testing approach and
accuracy values averaged over all the K-folds are reported.
6.4.1.1 Case study 1 (second-order polynomial)
This synthetic case study contains two input variables and the boundary between the two
classes is a second order polynomial. We obtain the non-linear boundary using piecewise SVM
without any kernel assumptions using the proposed algorithm. 1000 samples of input features
𝑥1 and 𝑥2 are randomly generated using uniform distribution in the range of [-5 5]. If a
particular data sample is in the positive half space for the polynomial 𝑥12 − 𝑥2 = 5, then it
belongs to class 1 otherwise class 2. Out of the 1000 random samples, 566 belongs to class 1
and remaining 434 to class 2.
The distribution of data and the averaged accuracy on K-fold train, test, and global tests are
provided in Figure 6.2. It can be observed from the bar chart that the averaged accuracies for
K-fold test set are 0.6288, 0.9175, 0.9125, 0.9388, 0.8875, and 0.8988 for number of models
being 1 to 6 respectively hence it is concluded that the non-linear boundary obtained using four
piecewise linear models is the most optimal solution for the given data. It can be observed from
averaged accuracies of K-fold test set and global test set (0.9262 and 0.9162 for a number of
models as 2 and 3 respectively) that the second order polynomial curvature can be better
expressed using 2 models compared to three for the given data.
126
127
6.4.1.2 Case study 2 (third-order polynomial)
This synthetic case study contains three input variables and the boundary between the two
classes is a third order polynomial. We obtain the non-linear boundary using piecewise SVM
without any kernel assumptions using the proposed algorithm. 2000 samples of input
features 𝑥1 , 𝑥2 and 𝑥3 are randomly generated using uniform distribution in the range of [-5 5].
If a particular data sample is in the positive half space for the polynomial 𝑥13 + 𝑥22 − 2𝑥32 +
2 𝑥1 𝑥2 𝑥3 + 4 𝑥1 𝑥3 − 3 𝑥1 + 2𝑥2 = 0, then it belongs to class 1, otherwise class 2. Out of the
2000 random samples, 1160 belongs to class 1 and remaining 840 to class 2. The distribution
of data and the averaged accuracy of K-fold train, test, and global tests are provided in Figure
6.3. It can be observed from the bar chart that the averaged accuracies for K-fold test set are
0.6594, 0.9175, 0.9200, 0.9156 and 0.9200 for a number of models being 1 to 5 respectively
hence it is concluded that three piecewise linear models are sufficient to mimic the non-linear
boundary between the two classes.
6.4.2 Real-life case studies
The efficacy of the proposed approach is tested on some real-life data sets obtained from
the UCI machine learning repository [180]. We compare the performance of the proposed
approach with some of the reported accuracies in the literature. Some of the data sets are
already divided into the train and test sets, and for the remaining data sets, to be consistent with
literature methods, we randomly divide the data into train and test sets with the same number
of data samples as in the literature. Accuracy values for test data of different data sets using
various approaches are reported in Table 6.1. For the sake of brevity, we only report the
accuracy values with the true number of models identified using the proposed approach. The
accuracy values reported for the existing approaches are obtained from Huang et al. [177]. It
128
can be observed from the table that the proposed approach is efficient compared to other
approaches on real-life data sets.
Table 6.1 Accuracy of various approaches on test data set of real-life data sets
No. of samples Adab Huang et al. [177] Proposed

Data set n kNN 𝑁𝑡𝑟𝑢𝑒
Train Test oost pwl-svm pwl-csvm approach
Monk1 6 124 432 0.828 0.692 0.750 0.736 0.833 7
Monk2 6 169 432 0.815 0.604 0.765 0.769 0.771 3
Monk3 6 122 432 0.824 0.940 0.972 0.972 0.889 2
Haberman 3 153 153 0.673 0.765 0.765 0.758 0.791 4
Ionosphere 33 176 175 0.857 0.867 0.857 0.829 0.880 4
6.5 Conclusions
In this work, a prediction error based piecewise SVM approach to identify the non-linear
boundary between both classes without any assumptions about the system is proposed. The
proposed algorithm uses a combination of soft and hard membership in its update rule. The
efficacy of the proposed approach is tested on various synthetic and real-life case studies. A
two-layer testing approach along with K-fold validation allows the algorithm to identify robust
models. It can be observed from the accuracy values on real-life case studies that the proposed
algorithm compares favourably with the existing approaches. It is interesting to note that for a
second order polynomial case study increasing the number of piecewise models from two to
three resulted in a decreased accuracy but further increasing to four resulted in a better solution.
It should be noted that increasing the number of piecewise models increases the training and
testing time and hence there is a trade-off between increased accuracy and computational effort.
129
130
CHAPTER 7
Conclusions
In this chapter, we provide key observations summarizing the work done in this thesis.
7.1 Incorporation of process information in the PCA framework
In chapter 2, we proposed model identification schemes to incorporate available process
information in the PCA framework. The efficacy of the proposed approaches are tested on
several case studies and results suggest an improvement over conventional PCA in terms of
better identification of the true underlying models. Further, models obtained using proposed
approaches are found to perform better than traditional PCA models for fault identification.
7.2 Prediction of drug solubility in binary solvent systems
In chapter 3, a generalized Jouyban-Acree model [15] is used to predict drug solubility of a
solute (i.e. drug molecule) in a binary solvent system, if pure solubility values in both solvents
are known. This original model cannot be used to predict solubility for systems where the
model parameters are not estimated earlier. In this work, we generalize model parameters as a
function of the structural features of compounds involved in the system. Once these generalized
models are estimated, information about structural features and pure solubility values can be
used to predict solubility for any new solute and mixed solvent system at a given temperature.
In essence, generalizing model parameters as a function of structural features provides the
flexibility to extrapolate the functional behavior of drug solubility with respect to their
structural features. The framework for generalization of the Jouyban-Acree model [15] using
131
Machine learning
algorithms
Jouyban-Acree
model
Figure 7.1 Generalization of first principles model using machine learning approaches to predict drug solubility
132
machine learning approaches is provided in Figure 7.1. Genetic algorithm is used to identify
significant features. It is assumed that solubility values are piecewise linearly dependent on
structural features and model coefficients are identified using a modified PE based clustering
algorithm. A two-layer testing approach is used to test the robustness of obtained models. A
comparison of MPD values obtained using the final set of multiple models for various binary
systems with the existing approaches suggests that the final set of models can be used to predict
solubility of new drugs in a wide variety of binary solvent systems.
7.3 Prediction error based clustering approach with statistical analysis
In chapter 4, we propose a clustering approach to identify the input partitions and significant
features in each partition along with the individual model parameters. The proposed clustering
approach does not assume any information regarding the number of models or model orders.
The proposed approach is examined on various benchmark case studies to demonstrate that
true model orders can be identified in each of the partitions. It is also noted that the efficacy of
the clustering approach is increased due to the removal of insignificant variables in each phase.
The proposed approach also provides very interesting insights about the process such as
delayed response, redundant variables, etc.
7.4 Prediction of solvation free energy of Quinone derivatives
Quinone derivatives are in demand as electrolytes for flow battery technology due to their
ability to transfer two electrons which results in high energy densities compared to the well-
established vanadium redox flow battery technology. In order to explore Quinone derivatives
with multiple functional groups in an inverse optimization framework, we need to establish
robust and quick structure-property based relationships to predict solubility, reduction
potential, etc. In chapter 5, we demonstrated the efficacy of several structure property based
relationships such as group contribution and QSPR approaches to predict solvation free energy
133
of Quinone derivatives. Though group contribution approach proved to be adequate when
compared to single linear and neural network based QSPR, the applicability of GC approach is
limited to the compounds that consist of only groups from a predefined set. Multiple models
based QSPR proved to be efficient when compared to all the other approaches due to their
ability to operate with different model structures (i.e. different significant features) in different
partitions. The structure-property relationship frameworks used in this work are depicted in
Figure 7.2. Though these frameworks are used to estimate solvation free energy of Quinones
in this work, these are general in nature and can be useful in other problems dealing with
structure-property predictions.
7.5 Piecewise linear SVM:
SVMs are one of the widely used binary classifiers due to their capability in identifying
complex boundaries between the classes using kernels. In chapter 6, we proposed a piecewise
SVM to identify non-linear boundaries in binary classification problems. The proposed
approach is tested on both synthetic and real-life case studies to show the efficacy of piecewise
SVM models in mimicking boundaries characterized by polynomial functions of second and
third order and other realistic non-linear functions. Prediction accuracies suggest that the
proposed PE based piecewise SVM is better than the existing literature approaches. Further,
the proposed approach does not require any prior knowledge about the data and hence, useful
for a wide range of applications.
134
𝑛
𝑓(𝑋) = 𝐶0 + 𝑁𝑖 𝐶𝑖
GC
𝑖=1
-F
-Cl
Linear
-C2H3
-CHO
MW, TSA, HOMO, Nonlinear

QSPR
IP, atom and bond
count etc.
Piecewise
linear
Figure 7.2 Various structure-property relationship based frameworks to estimate the properties of chemical compounds
135
7.6 Future scope
In this section, we have highlighted some of the future prospects based on the work done in
this thesis.
 Initially, we discussed various approaches to incorporate process information in case
of linear model identification using PCA. Incorporating such information in a
suitable framework for non-linear model identification is of future interest.
 In case of prediction of drug solubility, the Jouyban-Acree model [15] is designed
assuming the solubility of the solute in pure solvents are known. This information
will not be available for novel drugs, which are yet to be synthesized. So, obtaining
a generalized QSPR to estimate pure solubility values will be useful for designing
novel drugs in a computational framework.
 In case of prediction error based clustering approach with statistical analysis, we
initialize the models with a sufficiently high number and merge like models to
identify the true number of models. An alternative to this approach is to identify the
true number of models in an incremental fashion based on the prediction errors of
the obtained models on test data set in each iteration.
 A robust QSPR has been identified to predict solvation free energy of Quinone
derivatives. Obtaining structure-property relationship to identify other properties of
interest such as reduction potential will be beneficial to explore Quinone derivatives
in an inverse multi-objective optimization framework.
 Incorporation of domain knowledge in classification problems is a promising future
area for research.
136
REFERENCES
[1] J. H. Lee, J. Shin, and M. J. Realff, “Machine learning: Overview of the recent
progresses and implications for the process systems engineering field,” Comput. Chem.
Eng., vol. 114, pp. 111–121, 2018.
[2] Y. Han, Q. Zeng, Z. Geng, and Q. Zhu, “Energy management and optimization modeling
based on a novel fuzzy extreme learning machine: Case study of complex petrochemical
industries,” Energy Convers. Manag., vol. 165, pp. 163–171, 2018.
[3] R. Ramprasad, R. Batra, G. Pilania, A. Mannodi-Kanakkithodi, and C. Kim, “Machine
learning in materials informatics: recent applications and prospects,” npj Comput.
Mater., vol. 3, no. 1, p. 54, 2017.
[4] J. Lee, H. Davari, J. Singh, and V. Pandhare, “Industrial Artificial Intelligence for
industry 4.0-based manufacturing systems,” Manuf. Lett., vol. 18, pp. 20–23, 2018.
[5] T. B. Trafalis and H. Ince, “Support vector machine for regression and applications to
financial forecasting,” in Proceedings of the IEEE-INNS-ENNS International Joint
Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and
Perspectives for the New Millennium, 2000, vol. 6, pp. 348–353 vol.6.
[6] A. Lavecchia, “Machine-learning approaches in drug discovery: methods and
applications,” Drug Discov. Today, vol. 20, no. 3, pp. 318–331, 2015.
[7] M. W. Libbrecht and W. S. Noble, “Machine learning applications in genetics and
genomics,” Nat. Rev. Genet., vol. 16, p. 321, May 2015.
[8] P. Czop, G. Kost, D. Sławik, and G. Wszołek, “Formulation and identification of first-
principle data-driven models,” J. Achiev. Mater. Manuf. Eng., vol. 44, no. 2, pp. 179–
186, 2011.
[9] M. von Stosch, R. Oliveira, J. Peres, and S. Feyo de Azevedo, “Hybrid semi-parametric
modeling in process systems engineering: Past, present and future,” Comput. Chem.
Eng., vol. 60, pp. 86–101, 2014.
[10] W. H. Joerding and J. L. Meador, “Encoding a priori information in feedforward
networks,” Neural Networks, vol. 4, no. 6, pp. 847–856, 1991.
[11] D. C. Psichogios and L. H. Ungar, “A hybrid neural network‐first principles approach
to process modeling,” AIChE J., vol. 38, no. 10, pp. 1499–1511, 1992.
[12] H.-T. Su, N. Bhat, P. A. Minderman, and T. J. McAvoy, “Integrating neural networks
with first principles models for dynamic modeling,” in Dynamics and Control of
Chemical Reactors, Distillation Columns and Batch Processes, Elsevier, 1993, pp. 327–
332.
[13] S. Milanic, S. Strmcnik, D. Sel, N. Hvala, and R. Karba, “Incorporating prior knowledge
into artificial neural networks—an industrial case study,” Neurocomputing, vol. 62, pp.
131–151, 2004.
[14] O. Kahrs and W. Marquardt, “The validity domain of hybrid models and its application
in process optimization,” Chem. Eng. Process. Process Intensif., vol. 46, no. 11, pp.
1054–1066, 2007.
137
[15] A. Jouyban-Gharamaleki and W. E. Acree Jr, “Comparison of models for describing
multiple peaks in solubility profiles,” Int. J. Pharm., vol. 167, no. 1, pp. 177–182, 1998.
[16] S. Chen, S. A. Billings, and P. M. Grant, “Non-linear system identification using neural
networks,” Int. J. Control, vol. 51, no. 6, pp. 1191–1214, 1990.
[17] K. S. Narendra and K. Parthasarathy, “Identification and control of dynamical systems
using neural networks,” IEEE Trans. Neural Networks, vol. 1, no. 1, pp. 4–27, 1990.
[18] W. Xiong, L. Chen, F. Liu, and B. Xu, “Multiple model identification for a high purity
distillation column process based on EM algorithm,” Math. Probl. Eng., vol. 2014, 2014.
[19] B. Zhang and Z. Mao, “Modeling and control of Wiener systems using multiple models
and neural networks: application to a simulated pH process,” Ind. Eng. Chem. Res., vol.
55, no. 38, pp. 10147–10159, 2016.
[20] S. W. Choi, C. Lee, J.-M. Lee, J. H. Park, and I.-B. Lee, “Fault detection and
identification of nonlinear processes based on kernel PCA,” Chemom. Intell. Lab. Syst.,
vol. 75, no. 1, pp. 55–67, 2005.
[21] U. Kruger, Y. Zhou, and G. W. Irwin, “Improved principal component monitoring of
large-scale processes,” J. Process Control, vol. 14, no. 8, pp. 879–888, 2004.
[22] J.-M. Lee, C. Yoo, S. W. Choi, P. A. Vanrolleghem, and I.-B. Lee, “Nonlinear process
monitoring using kernel principal component analysis,” Chem. Eng. Sci., vol. 59, no. 1,
pp. 223–234, 2004.
[23] M. R. Maurya, R. Rengaswamy, and V. Venkatasubramanian, “Fault diagnosis by
qualitative trend analysis of the principal components,” Chem. Eng. Res. Des., vol. 83,
no. 9, pp. 1122–1132, 2005.
[24] H. Zou, T. Hastie, and R. Tibshirani, “Sparse principal component analysis,” J. Comput.
Graph. Stat., vol. 15, no. 2, pp. 265–286, 2006.
[25] J. Shi and W. Song, “Sparse principal component analysis with measurement errors,” J.
Stat. Plan. Inference, vol. 175, pp. 87–99, 2016.
[26] D. Shen, H. Shen, and J. S. Marron, “Consistency of sparse PCA in high dimension, low
sample size contexts,” J. Multivar. Anal., vol. 115, pp. 317–333, 2013.
[27] I. Jolliffe, Principal component analysis. Wiley Online Library, 2005.
[28] G. Chen and S.-E. Qian, “Denoising of hyperspectral imagery using principal
component analysis and wavelet shrinkage,” IEEE Trans. Geosci. Remote Sens., vol. 49,
no. 3, pp. 973–980, 2011.
[29] L. Zhang, W. Dong, D. Zhang, and G. Shi, “Two-stage image denoising by principal
component analysis with local pixel grouping,” Pattern Recognit., vol. 43, no. 4, pp.
1531–1549, 2010.
[30] W. Ku, R. H. Storer, and C. Georgakis, “Disturbance detection and isolation by dynamic
principal component analysis,” Chemom. Intell. Lab. Syst., vol. 30, no. 1, pp. 179–196,
1995.
[31] S. Narasimhan and S. L. Shah, “Model identification and error covariance matrix
estimation from noisy data using PCA,” Control Eng. Pract., vol. 16, no. 1, pp. 146–
155, 2008.
[32] J. C. Liao, R. Boscolo, Y.-L. Yang, L. M. Tran, C. Sabatti, and V. P. Roychowdhury,
“Network component analysis: reconstruction of regulatory signals in biological
138
systems,” Proc. Natl. Acad. Sci., vol. 100, no. 26, pp. 15522–15527, 2003.
[33] P.-S. Huang, S. D. Chen, P. Smaragdis, and M. Hasegawa-Johnson, “Singing-voice
separation from monaural recordings using robust principal component analysis,” in
2012 IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP), 2012, pp. 57–60.
[34] N. Locantore et al., “Robust principal component analysis for functional data,” vol. 8,
no. 1, pp. 1–73, 1999.
[35] F. De la Torre and M. J. Black, “Robust principal component analysis for computer
vision,” in Proceedings Eighth IEEE International Conference on Computer Vision.
ICCV 2001, 2001, vol. 1, pp. 362–369.
[36] E. J. Candès, X. Li, Y. Ma, and J. Wright, “Robust principal component analysis?,” J.
ACM, vol. 58, no. 3, p. 11, 2011.
[37] R. Jenatton, G. Obozinski, and F. Bach, “Structured sparse principal component
analysis,” in Proceedings of the Thirteenth International Conference on Artificial
Intelligence and Statistics, 2010, pp. 366–373.
[38] X. Qi, R. Luo, and H. Zhao, “Sparse principal component analysis by choice of norm,”
J. Multivar. Anal., vol. 114, pp. 127–160, 2013.
[39] C. R. Rao, “The use and interpretation of principal component analysis in applied
research,” Sankhyā Indian J. Stat. Ser. A, pp. 329–358, 1964.
[40] I. T. Jolliffe, N. T. Trendafilov, and M. Uddin, “A modified principal component
technique based on the LASSO,” J. Comput. Graph. Stat., vol. 12, no. 3, pp. 531–547,
2003.
[41] D. M. Witten, R. Tibshirani, and T. Hastie, “A penalized matrix decomposition, with
applications to sparse principal components and canonical correlation analysis,”
Biostatistics, vol. 10, no. 3, pp. 515–534, 2009.
[42] R. W. Serth and W. A. Heenan, “Gross error detection and data reconciliation in steam‐
metering systems,” AIChE J., vol. 32, no. 5, pp. 733–742, 1986.
[43] S. Sun, D. Huang, and Y. Gong, “Gross error detection and data reconciliation using
historical data,” Procedia Eng., vol. 15, pp. 55–59, 2011.
[44] J. J. Downs and E. F. Vogel, “A plant-wide industrial process control problem,” Comput.
Chem. Eng., vol. 17, no. 3, pp. 245–255, 1993.
[45] Y. Kawabata, K. Wada, M. Nakatani, S. Yamada, and S. Onoue, “Formulation design
for poorly water-soluble drugs based on biopharmaceutics classification system: Basic
approaches and practical applications,” Int. J. Pharm., vol. 420, no. 1, pp. 1–10, 2011.
[46] A. K. Nayak and P. P. Panigrahi, “Solubility enhancement of etoricoxib by cosolvency
approach,” ISRN Phys. Chem., vol. 2012, no. Article ID 820653, p. 5 pages, 2012.
[47] Z. Li and P. I. Lee, “Investigation on drug solubility enhancement using deep eutectic
solvents and their derivatives,” Int. J. Pharm., vol. 505, no. 1, pp. 283–288, 2016.
[48] T. Loftsson, “Drug solubilization by complexation,” Int. J. Pharm., vol. 531, no. 1, pp.
276–280, 2017.
[49] D. P. Elder, R. Holm, and H. L. de Diego, “Use of pharmaceutical salts and cocrystals
to address the issue of poor solubility,” Int. J. Pharm., vol. 453, no. 1, pp. 88–100, 2013.
139
[50] K. T. Savjani, A. K. Gajjar, and J. K. Savjani, “Drug Solubility: Importance and
Enhancement Techniques,” ISRN Pharm., vol. 2012, p. 195727, Jul. 2012.
[51] V. R. Vemula, V. Lagishetty, and S. Lingala, “Solubility enhancement techniques,” Int.
J. Pharm. Sci. Rev. Res., vol. 5, no. 1, pp. 41–51, 2010.
[52] L. Di, P. V Fish, and T. Mano, “Bridging solubility between drug discovery and
development,” Drug Discov. Today, vol. 17, no. 9, pp. 486–495, 2012.
[53] H. D. Williams et al., “Strategies to Address Low Drug Solubility in Discovery and
Development,” Pharmacol. Rev., vol. 65, no. 1, pp. 315 LP – 499, Jan. 2013.
[54] W. L. Jorgensen and E. M. Duffy, “Prediction of drug solubility from Monte Carlo
simulations,” Bioorg. Med. Chem. Lett., vol. 10, no. 11, pp. 1155–1158, 2000.
[55] W. L. Jorgensen and E. M. Duffy, “Prediction of drug solubility from structure,” Adv.
Drug Deliv. Rev., vol. 54, no. 3, pp. 355–366, 2002.
[56] Y. Ran and S. H. Yalkowsky, “Prediction of Drug Solubility by the General Solubility
Equation (GSE),” J. Chem. Inf. Comput. Sci., vol. 41, no. 2, pp. 354–357, Mar. 2001.
[57] J. S. Delaney, “Predicting aqueous solubility from structure,” Drug Discov. Today, vol.
10, no. 4, pp. 289–295, 2005.
[58] A. Lusci, G. Pollastri, and P. Baldi, “Deep Architectures and Deep Learning in
Chemoinformatics: The Prediction of Aqueous Solubility for Drug-Like Molecules,” J.
Chem. Inf. Model., vol. 53, no. 7, pp. 1563–1575, Jul. 2013.
[59] A. Jouyban, “Review of the cosolvency models for predicting solubility of drugs in
water-cosolvent mixtures,” J. Pharm. Pharm. Sci., vol. 11, no. 1, pp. 32–58, 2008.
[60] A. Maitra and S. Bagchi, “Study of solute–solvent and solvent–solvent interactions in
pure and mixed binary solvents,” J. Mol. Liq., vol. 137, no. 1, pp. 131–137, 2008.
[61] S. H. Yalkowsky and T. J. Roseman, Techniques of solubilization of drugs. M. Dekker,
1981.
[62] W. E. Acree Jr, “Mathematical representation of thermodynamic properties: Part 2.
Derivation of the combined nearly ideal binary solvent (NIBS)/Redlich-Kister
mathematical representation from a two-body and three-body interactional mixing
model,” Thermochim. Acta, vol. 198, no. 1, pp. 71–79, 1992.
[63] A. Jouyban-Gharamaleki and J. Hanaee, “A novel method for improvement of
predictability of the CNIBS/R-K equation,” Int. J. Pharm., vol. 154, no. 2, pp. 245–247,
1997.
[64] C.-C. Chen and Y. Song, “Solubility Modeling with a Nonrandom Two-Liquid Segment
Activity Coefficient Model,” Ind. Eng. Chem. Res., vol. 43, no. 26, pp. 8354–8362, Dec.
2004.
[65] E. Mullins, Y. A. Liu, A. Ghaderi, and S. D. Fast, “Sigma Profile Database for Predicting
Solid Solubility in Pure and Mixed Solvent Mixtures for Organic Pharmacological
Compounds with COSMO-Based Thermodynamic Methods,” Ind. Eng. Chem. Res., vol.
47, no. 5, pp. 1707–1725, Mar. 2008.
[66] E. Sheikholeslamzadeh and S. Rohani, “Solubility Prediction of Pharmaceutical and
Chemical Compounds in Pure and Mixed Solvents Using Predictive Models,” Ind. Eng.
Chem. Res., vol. 51, no. 1, pp. 464–473, Jan. 2012.
[67] P. B. Kokitkar, E. Plocharczyk, and C.-C. Chen, “Modeling Drug Molecule Solubility
140
to Identify Optimal Solvent Systems for Crystallization,” Org. Process Res. Dev., vol.
12, no. 2, pp. 249–256, Mar. 2008.
[68] C.-C. Shu and S.-T. Lin, “Prediction of Drug Solubility in Mixed Solvent Systems Using
the COSMO-SAC Activity Coefficient Model,” Ind. Eng. Chem. Res., vol. 50, no. 1, pp.
142–147, Jan. 2011.
[69] M. Valavi, M. Svärd, and Å. C. Rasmuson, “Prediction of the Solubility of Medium-
Sized Pharmaceutical Compounds Using a Temperature-Dependent NRTL-SAC
Model,” Ind. Eng. Chem. Res., vol. 55, no. 42, pp. 11150–11159, Oct. 2016.
[70] A. Jouyban, N. Y. K. Chew, H.-K. Chan, M. Sabour, and W. E. Acree Jr, “A unified
cosolvency model for calculating solute solubility in mixed solvents,” Chem. Pharm.
Bull., vol. 53, no. 6, pp. 634–637, 2005.
[71] A. Jouyban, S. Soltanpour, S. Soltani, E. Tamizi, M. A. A. Fakhree, and W. E. Acree,
“Prediction of drug solubility in mixed solvents using computed Abraham parameters,”
J. Mol. Liq., vol. 146, no. 3, pp. 82–88, 2009.
[72] A. Jouyban and M. A. A. Fakhree, “Experimental and Computational Methods
Pertaining to Drug Solubility,” Rijeka: InTech, 2012, p. Ch. 9.
[73] A. R. Katritzky et al., “Quantitative Correlation of Physical and Chemical Properties
with Chemical Structure: Utility for Prediction,” Chem. Rev., vol. 110, no. 10, pp. 5714–
5789, Oct. 2010.
[74] K. Roy, S. Kar, and R. N. Das, A primer on QSAR/QSPR modeling: Fundamental
concepts. Springer, 2015.
[75] M. Goodarzi, B. Dejaegher, and Y. Vander Heyden, “Feature selection methods in
QSAR studies,” J. AOAC Int., vol. 95, no. 3, pp. 636–651, 2012.
[76] K. Roy, S. Kar, and R. N. Das, “QSAR/QSPR Modeling: Introduction BT - A Primer
on QSAR/QSPR Modeling: Fundamental Concepts,” K. Roy, S. Kar, and R. N. Das,
Eds. Cham: Springer International Publishing, 2015, pp. 1–36.
[77] S. Yousefinejad and B. Hemmateenejad, “Chemometrics tools in QSAR/QSPR studies:
A historical perspective,” Chemom. Intell. Lab. Syst., vol. 149, pp. 177–204, 2015.
[78] V. Kuppuraj and R. Rengaswamy, “Evaluation of prediction error based fuzzy model
clustering approaches for multiple model learning,” Int. J. Adv. Eng. Sci. Appl. Math.,
vol. 4, no. 1–2, pp. 10–21, 2012.
[79] A. A. Adeniran and S. El Ferik, “Modeling and Identification of Nonlinear Systems: A
Review of the Multimodel Approach;Part 1,” IEEE Trans. Syst. Man, Cybern. Syst., vol.
47, no. 7, pp. 1149–1159, 2017.
[80] S. El Ferik and A. A. Adeniran, “Modeling and Identification of Nonlinear Systems: A
Review of the Multimodel Approach;Part 2,” IEEE Trans. Syst. Man, Cybern. Syst., vol.
47, no. 7, pp. 1160–1168, 2017.
[81] C. W. Yap, “PaDEL‐descriptor: An open source software to calculate molecular
descriptors and fingerprints,” J. Comput. Chem., vol. 32, no. 7, pp. 1466–1474, 2011.
[82] L. ChemAxon, “Instant J Chem/MarvinSketch,” 2012.
[83] S. Chinta, A. Sivaram, and R. Rengaswamy, “Prediction error-based clustering approach
for multiple-model learning using statistical testing,” Eng. Appl. Artif. Intell., vol. 77,
pp. 125–135, 2019.
141
[84] A. Jouyban et al., “Solubility Prediction of Drugs in Mixed Solvents Using Partial
Solubility Parameters,” J. Pharm. Sci., vol. 100, no. 10, pp. 4368–4382, Oct. 2011.
[85] A. Jouyban, M.-R. Majidi, H. Jalilzadeh, and K. Asadpour-Zeynali, “Modeling drug
solubility in water–cosolvent mixtures using an artificial neural network,” Farm., vol.
59, no. 6, pp. 505–512, 2004.
[86] A. Jouyban, M. A. A. Fakhree, T. Ghafourian, A. A. Saei, and W. E. Acree, “Deviations
of drug solubility in water-cosolvent mixtures from the Jouyban-Acree model–effect of
solute structure,” Die Pharm. Int. J. Pharm. Sci., vol. 63, no. 2, pp. 113–121, 2008.
[87] R. Murray-Smith. and T. A. (Eds. . Johansen, Multiple Model Approaches to Modelling
and Control. Taylor and Francis, London, 1997.
[88] H. Frigui and R. Krishnapuram, “A robust competitive clustering algorithm with
applications in computer vision,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 21, no.
5, pp. 450–465, 1999.
[89] G. Danuser and M. Stricker, “Parametric model fitting: From inlier characterization to
outlier detection,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 20, no. 3, pp. 263–280,
1998.
[90] V. Cherkassky and Y. Ma, “Multiple Model Estimation: A New Formulation for
Predictive Learning,” under Rev. IEE Trans. Neural Network., 2002.
[91] A. N. Venkat and R. D. Gudi, “Fuzzy segregation-based identification and control of
nonlinear dynamic systems,” Ind. Eng. Chem. Res., vol. 41, no. 3, pp. 538–552, 2002.
[92] M. A. Henson and D. E. Seborg, “Nonlinear control strategies for continuous
fermenters,” Chem. Eng. Sci., vol. 47, no. 4, pp. 821–835, 1992.
[93] R. Pickhardt, “Adaptive control of a solar power plant using a multi-model,” IEE Proc.
- Control Theory Appl., vol. 147, no. 5, pp. 493–500, 2000.
[94] J. G. Balchen, D. Ljungquist, and S. Strand, “State—space predictive control,” Chem.
Eng. Sci., vol. 47, no. 4, pp. 787–807, 1992.
[95] W. S. DeSarbo, R. L. Oliver, and A. Rangaswamy, “A simulated annealing methodology
for clusterwise linear regression,” Psychometrika, vol. 54, no. 4, pp. 707–736, 1989.
[96] H. Spath, “Algorithm 39 Clusterwise linear regression,” Computing, vol. 22, no. 4, pp.
367–373, 1979.
[97] G. Ferrari-Trecate, M. Muselli, D. Liberati, and M. Morari, “A clustering technique for
the identification of piecewise affine systems,” Automatica, vol. 39, no. 2, pp. 205–217,
2003.
[98] H. Nakada, K. Takaba, and T. Katayama, “Identification of piecewise affine systems
based on statistical clustering technique,” Automatica, vol. 41, no. 5, pp. 905–913, 2005.
[99] W. S. DeSarbo and W. L. Cron, “A maximum likelihood methodology for clusterwise
linear regression,” J. Classif., vol. 5, no. 2, pp. 249–282, 1988.
[100] H. Spath, “A fast algorithm for clusterwise linear regression,” Computing, vol. 29, no.
2, pp. 175–181, 1982.
[101] C. Hennig, “Models and methods for clusterwise linear regression,” Classif. Inf. Age,
pp. 179–187, 1999.
[102] C. Hennig, “Identifiablity of models for clusterwise linear regression,” J. Classif., vol.
142
17, no. 2, pp. 273–296, 2000.
[103] M. Wedel and C. Kistemaker, “Consumer benefit segmentation using clusterwise linear
regression,” Int. J. Res. Mark., vol. 6, no. 1, pp. 45–59, 1989.
[104] C. Preda and G. Saporta, “Clusterwise PLS regression on a stochastic process,” Comput.
Stat. Data Anal., vol. 49, no. 1, pp. 99–108, 2005.
[105] V. Cherkassky and Y. Ma, “Multiple model regression estimation,” IEEE Trans. neural
networks, vol. 16, no. 4, pp. 785–798, 2005.
[106] J. C. Bezdek, C. Coray, R. Gunderson, and J. Watson, “Detection and characterization
of cluster substructure i. linear structure: Fuzzy c-lines,” SIAM J. Appl. Math., vol. 40,
no. 2, pp. 339–357, 1981.
[107] F. Dufrenois and D. Hamad, “Fuzzy weighted support vector regression for multiple
linear model estimation: application to object tracking in image sequences,” in Neural
Networks, 2007. IJCNN 2007. International Joint Conference on, 2007, pp. 1289–1294.
[108] N. Elfelly, J.-Y. Dieulot, M. Benrejeb, and P. Borne, “A new approach for multimodel
identification of complex systems based on both neural and fuzzy clustering
algorithms,” Eng. Appl. Artif. Intell., vol. 23, no. 7, pp. 1064–1071, 2010.
[109] B. Pourbabaee, N. Meskin, and K. Khorasani, “Multiple-model based sensor fault
diagnosis using hybrid kalman filter approach for nonlinear gas turbine engines,” in
2013 American Control Conference, 2013, pp. 4717–4723.
[110] J. Ragot, “Diagnosis and control using multiple models. Application to a biological
reactor,” 2011 Int. Symp. Adv. Control Ind. Process., pp. 22–29, 2011.
[111] S. Dasgupta, B. D. O. Anderson, and R. J. Kaye, “Identification of physical parameters
in structured systems,” Automatica, vol. 24, no. 2, pp. 217–225, 1988.
[112] S. Paoletti, A. L. Juloski, G. Ferrari-Trecate, and R. Vidal, “Identification of hybrid
systems a tutorial,” Eur. J. Control, vol. 13, no. 2–3, pp. 242–260, 2007.
[113] R. Vidal and B. D. O. Anderson, “Recursive identification of switched ARX hybrid
models: Exponential convergence and persistence of excitation,” in Decision and
Control, 2004. CDC. 43rd IEEE Conference on, 2004, vol. 1, pp. 32–37.
[114] O. Rodolfo, M. Benoit, R. Jose, and M. Didier, “Nonlinear system identification using
heterogeneous multiple models,” International Journal of Applied Mathematics and
Computer Science, vol. 23. p. 103, 2013.
[115] M. Halkidi, Y. Batistakis, and M. Vazirgiannis, “Cluster validity methods: part I,” ACM
Sigmod Rec., vol. 31, no. 2, pp. 40–45, 2002.
[116] E. Rendon et al., “A comparison of internal and external cluster validation indexes,” in
Proceedings of the 2011 American Conference, San Francisco, CA, USA, 2011, vol. 29.
[117] Y.-L. Wu, C.-Y. Tang, M.-K. Hor, and P.-F. Wu, “Feature selection using genetic
algorithm and cluster validation,” Expert Syst. Appl., vol. 38, no. 3, pp. 2727–2732,
2011.
[118] D. C. Montgomery and G. C. Runger, Applied statistics and probability for engineers.
John Wiley & Sons, 2010.
[119] M. E. Gegundez, J. Aroba, and J. M. Bravo, “Identification of piecewise affine systems
by means of fuzzy clustering and competitive learning,” Eng. Appl. Artif. Intell., vol. 21,
no. 8, pp. 1321–1329, 2008.
143
[120] A. Tsanas and A. Xifara, “Accurate quantitative estimation of energy performance of
residential buildings using statistical machine learning tools,” Energy Build., vol. 49, pp.
560–567, 2012.
[121] M. Nikravesh, A. E. Farell, and T. G. Stanford, “Control of nonisothermal CSTR with
time varying parameters via dynamic neural network control (DNNC),” Chem. Eng. J.,
vol. 76, no. 1, pp. 1–16, 2000.
[122] Z. Wang et al., “Clustering by Local Gravitation,” IEEE Trans. Cybern., vol. 48, no. 5,
pp. 1383–1396, 2018.
[123] T. Nguyen and R. F. Savinell, “Flow batteries,” Electrochem. Soc. Interface, vol. 19, no.
3, pp. 54–56, 2010.
[124] P. Leung, X. Li, C. P. De León, L. Berlouis, C. T. J. Low, and F. C. Walsh, “Progress
in redox flow batteries, remaining challenges and their applications in energy storage,”
Rsc Adv., vol. 2, no. 27, pp. 10125–10156, 2012.
[125] G. L. Soloveichik, “Flow Batteries: Current Status and Trends,” Chem. Rev., vol. 115,
no. 20, pp. 11533–11558, Oct. 2015.
[126] S. Er, C. Suh, M. P. Marshak, and A. Aspuru-Guzik, “Computational design of
molecules for an all-quinone redox flow battery,” Chem. Sci., vol. 6, no. 2, pp. 885–893,
2015.
[127] L. Constantinou and R. Gani, “New group contribution method for estimating properties
of pure compounds,” AIChE J., vol. 40, no. 10, pp. 1697–1710, 1994.
[128] C. Gao, R. Govind, and H. H. Tabak, “Application of the group contribution method for
predicting the toxicity of organic chemicals,” Environ. Toxicol. Chem., vol. 11, no. 5,
pp. 631–636, 1992.
[129] K. M. Klincewicz and R. C. Reid, “Estimation of critical properties with group
contribution methods,” AIChE J., vol. 30, no. 1, pp. 137–142, 1984.
[130] K. G. Joback and R. C. Reid, “Estimation of pure-component properties from group-
contributions,” Chem. Eng. Commun., vol. 57, no. 1–6, pp. 233–243, 1987.
[131] E. Conte, A. Martinho, H. A. Matos, and R. Gani, “Combined Group-Contribution and
Atom Connectivity Index-Based Methods for Estimation of Surface Tension and
Viscosity,” Ind. Eng. Chem. Res., vol. 47, no. 20, pp. 7940–7954, Oct. 2008.
[132] J. Marrero and R. Gani, “Group-contribution based estimation of pure component
properties,” Fluid Phase Equilib., vol. 183–184, pp. 183–208, 2001.
[133] J. Marrero and R. Gani, “Group-contribution-based estimation of octanol/water partition
coefficient and aqueous solubility,” Ind. Eng. Chem. Res., vol. 41, no. 25, pp. 6623–
6633, 2002.
[134] A. Correa, J. F. Comesaña, J. M. Correa, and A. M. Sereno, “Measurement and
prediction of water activity in electrolyte solutions by a modified ASOG group
contribution method,” Fluid Phase Equilib., vol. 129, no. 1, pp. 267–283, 1997.
[135] S. J. Patel, D. Ng, and M. S. Mannan, “QSPR Flash Point Prediction of Solvents Using
Topological Indices for Application in Computer Aided Molecular Design,” Ind. Eng.
Chem. Res., vol. 48, no. 15, pp. 7378–7387, Aug. 2009.
[136] A. R. Katritzky, Y. Wang, S. Sild, T. Tamm, and M. Karelson, “QSPR Studies on Vapor
Pressure, Aqueous Solubility, and the Prediction of Water−Air Partition Coefficients,”
144
J. Chem. Inf. Comput. Sci., vol. 38, no. 4, pp. 720–725, Jul. 1998.
[137] M. Muehlbacher, A. El Kerdawy, C. Kramer, B. Hudson, and T. Clark, “Conformation-
Dependent QSPR Models: logPOW,” J. Chem. Inf. Model., vol. 51, no. 9, pp. 2408–
2416, Sep. 2011.
[138] P. R. Duchowicz and E. A. Castro, “QSPR studies on aqueous solubilities of drug-like
compounds,” Int. J. Mol. Sci., vol. 10, no. 6, pp. 2558–2577, Jun. 2009.
[139] F. Luan, T. Wang, L. Tang, S. Zhang, and M. Cordeiro, “Estimation of the Toxicity of
Different Substituted Aromatic Compounds to the Aquatic Ciliate Tetrahymena
pyriformis by QSAR Approach.,” Molecules, vol. 23, no. 5, 2018.
[140] T. Miyao, H. Kaneko, and K. Funatsu, “Inverse QSPR/QSAR Analysis for Chemical
Structure Generation (from y to x),” J. Chem. Inf. Model., vol. 56, no. 2, pp. 286–299,
Feb. 2016.
[141] L. Xu and W.-J. Zhang, “Comparison of different methods for variable selection,” Anal.
Chim. Acta, vol. 446, no. 1, pp. 475–481, 2001.
[142] D. K. Agrafiotis and W. Cedeño, “Feature Selection for Structure−Activity Correlation
Using Binary Particle Swarms,” J. Med. Chem., vol. 45, no. 5, pp. 1098–1107, Feb.
2002.
[143] S. Yousefinejad, F. Honarasa, and H. Montaseri, “Linear solvent structure-polymer
solubility and solvation energy relationships to study conductive polymer/carbon
nanotube composite solutions,” RSC Adv., vol. 5, no. 53, pp. 42266–42275, 2015.
[144] B. Hemmateenejad, “Optimal QSAR analysis of the carcinogenic activity of drugs by
correlation ranking and genetic algorithm-based PCR,” J. Chemom., vol. 18, no. 11, pp.
475–485, Nov. 2004.
[145] D. J. Livingstone, D. T. Manallack, and I. V Tetko, “Data modelling with neural
networks: advantages and limitations,” J. Comput. Aided. Mol. Des., vol. 11, no. 2, pp.
135–142, 1997.
[146] S. Wang and M. Tanaka, “Nonlinear system identification with piecewise-linear
functions,” IFAC Proc. Vol., vol. 32, no. 2, pp. 3796–3801, 1999.
[147] S. Chinta and R. Rengaswamy, “Machine Learning Derived Quantitative Structure
Property Relationship (QSPR) to Predict Drug Solubility in Binary Solvent Systems,”
Ind. Eng. Chem. Res., vol. 58, no. 8, pp. 3082–3092, Feb. 2019.
[148] C. Liang and D. A. Gallagher, “QSPR Prediction of Vapor Pressure from Solely
Theoretically-Derived Descriptors,” J. Chem. Inf. Comput. Sci., vol. 38, no. 2, pp. 321–
324, Mar. 1998.
[149] G. R. Famini, C. A. Penski, and L. Y. Wilson, “Using theoretical descriptors in
quantitative structure activity relationships: Some physicochemical properties,” J. Phys.
Org. Chem., vol. 5, no. 7, pp. 395–408, Jul. 1992.
[150] M. Jalali-Heravi, M. Asadollahi-Baboli, and P. Shahbazikhah, “QSAR study of
heparanase inhibitors activity using artificial neural networks and Levenberg–Marquardt
algorithm,” Eur. J. Med. Chem., vol. 43, no. 3, pp. 548–556, 2008.
[151] M. Gevrey, I. Dimopoulos, and S. Lek, “Review and comparison of methods to study
the contribution of variables in artificial neural network models,” Ecol. Modell., vol.
160, no. 3, pp. 249–264, 2003.
145
[152] C. Cortes and V. Vapnik, “Support-vector networks,” Mach. Learn., vol. 20, no. 3, pp.
273–297, 1995.
[153] S. Amari and S. Wu, “Improving support vector machine classifiers by modifying kernel
functions,” Neural Networks, vol. 12, no. 6, pp. 783–789, 1999.
[154] D. Hunter, H. Yu, I. I. I. M. S. Pukish, J. Kolbusz, and B. M. Wilamowski, “Selection
of Proper Neural Network Sizes and Architectures—A Comparative Study,” IEEE
Trans. Ind. Informatics, vol. 8, no. 2, pp. 228–240, 2012.
[155] S. R. Safavian and D. Landgrebe, “A survey of decision tree classifier methodology,”
IEEE Trans. Syst. Man. Cybern., vol. 21, no. 3, pp. 660–674, 1991.
[156] L. Breiman, “Random forests,” Mach. Learn., vol. 45, no. 1, pp. 5–32, 2001.
[157] J. Sklansky and L. Michelotti, “Locally trained piecewise linear classifiers,” IEEE
Trans. Pattern Anal. Mach. Intell., no. 2, pp. 101–111, 1980.
[158] D. R. Cox, “The regression analysis of binary sequences,” J. R. Stat. Soc. Ser. B, vol.
20, no. 2, pp. 215–232, 1958.
[159] S. B. Kotsiantis, I. Zaharakis, and P. Pintelas, “Supervised machine learning: A review
of classification techniques,” Emerg. Artif. Intell. Appl. Comput. Eng., vol. 160, pp. 3–
24, 2007.
[160] L. Nanni and A. Lumini, “An experimental comparison of ensemble of classifiers for
biometric data,” Neurocomputing, vol. 69, no. 13, pp. 1670–1673, 2006.
[161] L. Breiman, “Bagging predictors,” Mach. Learn., vol. 24, no. 2, pp. 123–140, 1996.
[162] I. Barandiaran, “The random subspace method for constructing decision forests,” IEEE
Trans. Pattern Anal. Mach. Intell, vol. 20, no. 8, 1998.
[163] L. Nanni and A. Lumini, “An experimental comparison of ensemble of classifiers for
bankruptcy prediction and credit scoring,” Expert Syst. Appl., vol. 36, no. 2, Part 2, pp.
3028–3033, 2009.
[164] A. Bharadwaj and S. Minz, “Hybrid Approach for Classification using Support Vector
Machine and Decision Tree,” in Int Conf Advances in Electronics, Electrical and
Computer Science Engineering (EEC 2012), 2012, pp. 337–341.
[165] M. Arun Kumar and M. Gopal, “A hybrid SVM based decision tree,” Pattern Recognit.,
vol. 43, no. 12, pp. 3977–3987, 2010.
[166] S. Chakrabartty, G. Singh, and G. Cauwenberghs, “Hybrid support vector
machine/hidden markov model approach for continuous speech recognition,” in
Proceedings of the 43rd IEEE Midwest Symposium on Circuits and Systems (Cat. No.
CH37144), 2000, vol. 2, pp. 828–831.
[167] N. Zaini, M. A. Malek, M. Yusoff, N. H. Mardi, and S. Norhisham, “Daily River Flow
Forecasting with Hybrid Support Vector Machine – Particle Swarm Optimization,” IOP
Conf. Ser. Earth Environ. Sci., vol. 140, p. 12035, 2018.
[168] A. Ghodselahi, “A hybrid support vector machine ensemble model for credit scoring,”
Int. J. Comput. Appl., vol. 17, no. 5, pp. 1–5, 2011.
[169] A. Rahman and S. Tasnim, “Ensemble classifiers and their applications: a review,” arXiv
Prepr. arXiv1404.4088, 2014.
[170] A. Kostin, “A simple and fast multi-class piecewise linear pattern classifier,” Pattern
146
Recognit., vol. 39, no. 11, pp. 1949–1962, 2006.
[171] D. Webb, Efficient piecewise linear classifiers and applications. University of Ballarat,
2011.
[172] G. T. Herman and K. T. D. Yeung, “On piecewise-linear classification,” IEEE Trans.
Pattern Anal. Mach. Intell., no. 7, pp. 782–786, 1992.
[173] H. Tenmoto, M. Kudo, and M. Shimbo, “Piecewise linear classifiers with an appropriate
number of hyperplanes,” Pattern Recognit., vol. 31, no. 11, pp. 1627–1634, 1998.
[174] A. M. Bagirov, J. Ugon, and D. Webb, “An efficient algorithm for the incremental
construction of a piecewise linear classifier,” Inf. Syst., vol. 36, no. 4, pp. 782–790, 2011.
[175] A. Astorino and M. Gaudioso, “Polyhedral separability through successive LP,” J.
Optim. Theory Appl., vol. 112, no. 2, pp. 265–293, 2002.
[176] A. M. Bagirov, “Max–min separability,” Optim. Methods Softw., vol. 20, no. 2–3, pp.
277–296, 2005.
[177] X. Huang, S. Mehrkanoon, and J. A. K. Suykens, “Support vector machines with
piecewise linear feature mapping,” Neurocomputing, vol. 117, pp. 118–127, 2013.
[178] J. A. K. Suykens and J. Vandewalle, “Least Squares Support Vector Machine
Classifiers,” Neural Process. Lett., vol. 9, no. 3, pp. 293–300, 1999.
[179] G. Ou and Y. L. Murphey, “Multi-class pattern classification using neural networks,”
Pattern Recognit., vol. 40, no. 1, pp. 4–18, 2007.
[180] A. Asuncion and D. Newman, “UCI machine learning repository.” 2007.
147
LIST OF PAPERS BASED ON THESIS
1. Sivadurgaprasad Chinta and Raghunathan Rengaswamy, Machine Learning Derived

Quantitative Structure Property Relationship (QSPR) to Predict Drug Solubility in
Binary Solvent Systems. Ind. Eng. Chem. Res. 2019, 58 (8), 3082–3092.
2. Sivadurgaprasad Chinta, Abhishek Sivaram and Raghunathan Rengaswamy, Prediction

Error-Based Clustering Approach for Multiple-Model Learning Using Statistical
Testing. Eng. Appl. Artif. Intell. 2019, 77, 125–135.
3. Deepak Maurya, Sivadurgaprasad Chinta, Abhishek Sivaram and Raghunathan

Rengaswamy, Incorporating prior knowledge about structural constraints in model
identification. Ind. Eng. Chem. Res. Under review.
4. Sivadurgaprasad Chinta and Raghunathan Rengaswamy, Machine learning based

QSPR approaches to predict solvation free energy of Quinone molecules for flow
battery applications. Manuscript under preparation.
5. Sivadurgaprasad Chinta and Raghunathan Rengaswamy, An adaptive prediction error

based multiple model SVM classifier. Manuscript under preparation.
148

CH15D008 Thesis PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CH15D008 Thesis PDF

Uploaded by

Copyright:

Available Formats

INTEGRATION OF MACHINE LEARNING AND DOMAIN

KNOWLEDGE FOR ENGINEERING APPLICATIONS

for the award of the degree

DEPARTMENT OF CHEMICAL ENGINEERING

INDIAN INSTITUTE OF TECHNOLOGY MADRAS

DOMAIN KNOWLEDGE FOR ENGINEERING APPLICATIONS submitted by Chinta

Institute or University for the award of any degree or diploma.

Prof. Raghunathan Rengaswamy Prof. Sridharakumar Narasimhan

My parents (Chinta Vijayalakshmi and Chinta Thavitinaidu), my family members, my Ph.D.

I am extremely thankful to work under the guidance of Prof. Raghunathan Rengaswamy.

throughout my tenure. I also express my sincere thanks to my co-guide Prof. Sridharakumar

Narasimhan, and my doctoral committee members, Prof. MV Saganaraynan, Prof. Preeti

I am blessed to have Dr Hemanth, Dr Danny, Dr Srinivasan, Dr Reshmi and Mr Maikandan

my gratitude to Mr Abhishek, Mr Deepak, Mr Arun, Mr Faheem, Mr Sathish, Mr

in group meetings. I am very thankful to my friends Mr Yerrayya, Mr Sridhar, Mr Vinayakram,

Mr Santhan, Mr Ravi, Mr Prasanth, Mr Eswar, Mr Siva, Mr Raju, Mr Moulish, Mrs Neha

created in me towards research and optimization. I also wanted to thank Dr Lakshmi, Mr

experiences about machine learning applications in their respective fields.

physical interpretability. Though knowledge-based or first principles models provide good

information in the underlying functional relationships in a principal component analysis

between the classes is also proposed in this thesis

KEYWORDS: Machine learning, domain knowledge, hybrid modeling, CSPCA, multiple

model learning, piecewise SVM, drug solubility

1.1 Motivation .............................................................................................................. 2

2. Integration of process information in the PCA framework ......................................... 6

2.1 Literature survey .................................................................................................... 6

3. Generalization of first principles derived model using machine learning approaches

3.1 Literature survey .................................................................................................. 38

4.1 Literature review: ................................................................................................. 68

5. Prediction of solvation free energy of Quinone derivatives using machine learning

5.1 Literature survey .................................................................................................. 97

6.1 Literature survey ................................................................................................ 116

7. Conclusions ................................................................................................................... 131

7.1 Incorporation of process information in the PCA framework ........................... 131

Table 3.1 Details of feature selection process using GA ......................................................... 47

Table 4.3 Model information of SMLR example 3 ................................................................. 80

Table 4.5 Converged model details of SMLR example 3 ........................................................ 82

Figure 2.1 Flow mixing case study .......................................................................................... 11

Figure 2.2 Euclidean norm of residuals using both approaches .............................................. 19

Figure 4.1 Multiple Model Learning Problem Classification .................................................. 67

Figure 5.1 Group contribution approach framework ............................................................. 102

price of a commodity etc. Multivariate linear regression, polynomial regression, piecewise

regression. Machine learning techniques for classification identify mathematical relationships

techniques for classification.

While the use of ML is increasing, it is argued in the scientific communities that ML

proposed an approach to incorporate domain knowledge in artificial neural networks for

Outputs Inputs Outputs Inputs

Model Machine learning Machine learning

1.2 Thesis contents

 Incorporation of process information such as a subset of true constraints or sparse

information of constraint matrix in PCA framework for model identification.

 Identification of model parameters in first principles derived model to estimate drug

solubility in a binary solvent system using machine learning approaches.

identification together in a piecewise linear modeling framework.

 Piecewise SVM approach for binary classification.

In this dissertation, initially, a brief introduction to various frameworks to integrate domain

knowledge with machine learning techniques is provided in chapter 1. In chapter 2, the

approach is examined in a quantitative structure property relationship (QSPR) framework to

system at a given temperature, in which model parameters are estimated as functions of

to 6 followed by the respective problem statements.

Integration of process information in the PCA framework

implementation. Principal component analysis (PCA) is a popular machine learning approach

2.1 Literature survey