You are on page 1of 160

INTEGRATION OF MACHINE LEARNING AND DOMAIN

KNOWLEDGE FOR ENGINEERING APPLICATIONS

A THESIS

Submitted by

CHINTA SIVADURGAPRASAD

for the award of the degree

of

DOCTOR OF PHILOSOPHY

DEPARTMENT OF CHEMICAL ENGINEERING

INDIAN INSTITUTE OF TECHNOLOGY MADRAS

CHENNAI-600036, INDIA

MAY 2019
THESIS CERTIFICATE

This is to certify that the thesis titled INTEGRATION OF MACHINE LEARNING AND

DOMAIN KNOWLEDGE FOR ENGINEERING APPLICATIONS submitted by Chinta

Sivadurgaprasad, to the Indian Institute of Technology Madras, for the award of the degree

of Doctor of Philosophy, is a bona fide record of the research work done by him under my

supervision. The contents of this thesis, in full or in parts, have not been submitted to any other

Institute or University for the award of any degree or diploma.

Prof. Raghunathan Rengaswamy Prof. Sridharakumar Narasimhan


Research Guide Research Guide
Professor Professor
Dept. of Chemical Engineering Dept. of Chemical Engineering
IIT-Madras, 600 036 IIT-Madras, 600 036

Place: Chennai

Date:
This thesis is dedicated to

My parents (Chinta Vijayalakshmi and Chinta Thavitinaidu), my family members, my Ph.D.

advisor Prof. Raghunathan Rengaswamy, almighty Lord Shiva, my best friend Krishnaveni

HM and the people who triggered and encouraged my interest for the numbers and math

throughout my life.

i
ACKNOWLEDGEMENTS

I am extremely thankful to work under the guidance of Prof. Raghunathan Rengaswamy.

I wish to know the secrets behind his time management and blissful smile even in hard

situations. Every time I knocked his cabin with a problem let it be technical or personal, I came

out with a solution without fail. If god gives me a chance, I would like do another Ph.D. under

his guidance. The crisp and critical suggestions that he provided are very useful for the

completion of this work. I owe for the morale support and technical guidance he provided

throughout my tenure. I also express my sincere thanks to my co-guide Prof. Sridharakumar

Narasimhan, and my doctoral committee members, Prof. MV Saganaraynan, Prof. Preeti

Aghalayam, Prof S Pushpavanam, Prof. R Nagarajan and Prof. A Kannan for their valuable

suggestions.

I am blessed to have Dr Hemanth, Dr Danny, Dr Srinivasan, Dr Reshmi and Mr Maikandan

as my seniors whose critical comments have enlightened my problem solving skills in research.

Especially, the morale and technical support from Dr Hemanth is unforgettable. I am grateful

to Mr Suseendiran and Dr Deepa for the long technical and personal discussions we had

together, which helped me to shape my thesis and my attitude towards the life. I also extend

my gratitude to Mr Abhishek, Mr Deepak, Mr Arun, Mr Faheem, Mr Sathish, Mr

Venkataraman and Dr Amit and other members of SENAI research group for the fun we had

in group meetings. I am very thankful to my friends Mr Yerrayya, Mr Sridhar, Mr Vinayakram,

Mr Santhan, Mr Ravi, Mr Prasanth, Mr Eswar, Mr Siva, Mr Raju, Mr Moulish, Mrs Neha

Aravind, Miss Madhu, Miss Priyanka, Miss Snehal, Dr Surya and Dr Sathyam naidu whose

presence made the campus life cherished and memorable. I am in debt to my best friend

Krishnaveni HM for her constant motivation and support during this journey.

ii
I also wanted to thank my M. Tech supervisor Dr Prakash Kotecha for the enthusiasm he

created in me towards research and optimization. I also wanted to thank Dr Lakshmi, Mr

Ganesh, Mr Sam Mathew and the training team of Gyan Data Pvt Ltd for providing a great

corporate experience during my internship. I thank Mrs. Shashikala and Mrs. Saraswathi and

other office staff of Department of Chemical Engineering, IIT Maras for their help in all the

office related works. I also thank the management of Robert Bosch Centre for Data science and

Artificial Intelligence for providing computational facilities and its members sharing their

experiences about machine learning applications in their respective fields.

Last but not least, I wanted to thank my family members for the morale support and their

belief in me that I can do something great in life. I also want to thank my cousins Mr Ajay, Mr

Hari and Mr Narendra for being there with me in all my thick and thins.

iii
ABSTRACT

Model identification is crucial in chemical process industries for various applications such

as process monitoring, control, etc. In the past few decades, machine learning algorithms are

of interest for modeling due to their ability to identify complex behavior and computational

tractability. Most of these algorithms are purely data-driven thus raising questions about their

physical interpretability. Though knowledge-based or first principles models provide good

interpretability about the process, formulating and solving such models are time-consuming.

In this thesis, we propose different ways of integrating machine learning techniques and

domain knowledge to harness the advantages of both modeling approaches such as ease of

modeling and good physical interpretability. One such framework incorporates sparsity

information in the underlying functional relationships in a principal component analysis

framework for linear model identification. The use of existing first principles model with

machine learning approaches for structure-property predictions is also explored. The proposed

frameworks demonstrate that the performance of existing approaches can be improved and the

applicability of the existing models can be extended to a broad range of systems. Two new

machine learning algorithms, one for regression, and one for classification using prediction

error based fuzzy clustering approach to identify non-linear functional behavior or boundary

between the classes is also proposed in this thesis

KEYWORDS: Machine learning, domain knowledge, hybrid modeling, CSPCA, multiple

model learning, piecewise SVM, drug solubility

iv
TABLE OF CONTENTS

ACKNOWLEDGEMENTS ........................................................................................................................ ii
ABSTRACT ................................................................................................................................................ iv
LIST OF TABLES .................................................................................................................................... viii
LIST OF FIGURES .................................................................................................................................... ix

1. Introduction ...................................................................................................................... 1

1.1 Motivation .............................................................................................................. 2


1.2 Thesis contents ....................................................................................................... 4
1.3 Organization of the dissertation ............................................................................. 5

2. Integration of process information in the PCA framework ......................................... 6

2.1 Literature survey .................................................................................................... 6


2.2 Mathematical foundations ...................................................................................... 8
2.3 PCA for Model Identification ................................................................................ 9
2.4 Model Identification with partially known constraint matrix (cPCA) ................. 11
2.4.1 Model Identification when a subset of linear relations are known ...................... 16
2.5 Model Identification with known model structure (sPCA).................................. 18
2.5.1 Case study 1 ......................................................................................................... 26
2.5.2 Case study 2 ......................................................................................................... 27
2.5.3 Case study 3 ......................................................................................................... 29
2.6 Constraint Structural PCA ................................................................................... 31
2.6.1 ECC Case study ................................................................................................... 33
2.7 Conclusion ........................................................................................................... 36

3. Generalization of first principles derived model using machine learning approaches


to predict drug solubility in binary systems ................................................................ 38

3.1 Literature survey .................................................................................................. 38


3.2 Data preparation and processing .......................................................................... 42
3.3 Feature selection .................................................................................................. 45
3.4 Single model approximations .............................................................................. 48
3.5 Multiple model approximation ............................................................................ 50
3.6 Conclusion ........................................................................................................... 64

v
4. Prediction error based fuzzy clustering approach using statistical analysis for
piecewise linear model identification............................................................................ 66

4.1 Literature review: ................................................................................................. 68


4.2 PE based fuzzy clustering with statistical significance testing ............................ 71
4.3 Efficacy of proposed approach to estimate static multiple linear regression
(SMLR) models ............................................................................................................... 76
4.3.1 SMLR example 1 ................................................................................................. 77
4.3.2 SMLR example 2 ................................................................................................. 78
4.3.3 SMLR example 3 ................................................................................................. 78
4.3.4 SMLR example 4 ................................................................................................. 83
4.4 Efficacy of proposed approach to identify PWARX models ............................... 84
4.4.1 PWARX example 1.............................................................................................. 85
4.4.2 PWARX example 2.............................................................................................. 85
4.4.3 PWARX example with non-linear dynamics ....................................................... 86
4.5 Efficacy of the proposed approach on two real-life case studies ......................... 88
4.5.1 Identification of energy performance of residential buildings ............................. 88
4.5.2 Identification of non-isothermal CSTR model dynamics for control .................. 89
4.6 Conclusion ........................................................................................................... 96

5. Prediction of solvation free energy of Quinone derivatives using machine learning


approaches in a QSPR framework ............................................................................... 97

5.1 Literature survey .................................................................................................. 97


5.2 Group contribution approach ............................................................................. 101
5.3 QSPR based approaches .................................................................................... 104
5.3.1 Single linear model based QSPR ....................................................................... 106
5.3.2 Neural network based QSPR ............................................................................. 107
5.3.3 Multiple model based QSPR.............................................................................. 108
5.4 Conclusion ......................................................................................................... 115

6. An adaptive prediction error based multiple model SVM classifier for binary
classification problems ................................................................................................. 116

6.1 Literature survey ................................................................................................ 116


6.2 Support vector machines .................................................................................... 121
6.3 Multiple model based SVM ............................................................................... 122

vi
6.4 Evaluation of proposed binary classifier............................................................ 125
6.4.1 Synthetic case studies ........................................................................................ 125
6.4.1.1 Case study 1 (second-order polynomial) ................................................... 126
6.4.1.2 Case study 2 (third-order polynomial) ....................................................... 128
6.4.2 Real-life case studies.......................................................................................... 128
6.5 Conclusions ........................................................................................................ 129

7. Conclusions ................................................................................................................... 131

7.1 Incorporation of process information in the PCA framework ........................... 131


7.2 Prediction of drug solubility in binary solvent systems ..................................... 131
7.3 Prediction error based clustering approach with statistical analysis .................. 133
7.4 Prediction of solvation free energy of Quinone derivatives .............................. 133
7.5 Piecewise linear SVM: ....................................................................................... 134
7.6 Future scope ....................................................................................................... 136
REFERENCES ........................................................................................................................................ 137

vii
LIST OF TABLES

Table 2.1 Understanding steps 1-4 of the CSPCA algorithm for case study 2.5.3 .................. 32

Table 3.1 Details of feature selection process using GA ......................................................... 47

Table 3.2 Various efficacy metrics of obtained models using both single model approaches 59

Table 3.3 Various efficacy metrics of multiple models obtained using the modified PE
approach ................................................................................................................................... 59

Table 3.4 MPD metrics of various water + cosolvent systems using several approaches ....... 62

Table 4.1 Original and converged model details of SMLR example 1 ................................... 79

Table 4.2 Original and converged model details of SMLR example 2 ................................... 80

Table 4.3 Model information of SMLR example 3 ................................................................. 80

Table 4.4 Converged model details of SMLR example 3 without statistical analysis ............ 81

Table 4.5 Converged model details of SMLR example 3 ........................................................ 82

Table 4.6 Original and converged models of SMLR example 4 using both FMC algorithms 84

Table 4.7 Original and converged models of PWARX example 1 and 2 using both FMC
algorithms ................................................................................................................................ 87

Table 4.8 Information of original and converged non-linear PWARX models for training data
set ............................................................................................................................................. 91

Table 4.9 Information of converged models and corresponding metrics for prediction accuracy
.................................................................................................................................................. 91

Table 4.10 Details of prediction accuracy of different model identification approaches ........ 95

Table 5.1 All 41 groups that are considered for the case study along with contributions ..... 103

Table 5.2 Features ranking obtained using a stepwise approach for NN-QSPR ................... 108

Table 5.3 Details of the final set of multiple models converged .......................................... 113

Table 5.4 Performance metrics of several approaches for solvation free energy estimation 114

Table 6.1 Accuracy of various approaches on test data set of real-life data sets ................... 129

viii
LIST OF FIGURES

Figure 1.1 Various ways of integrating machine learning and domain knowledge................... 3

Figure 2.1 Flow mixing case study .......................................................................................... 11

Figure 2.2 Euclidean norm of residuals using both approaches .............................................. 19

Figure 2.3 Comparison of model estimates by sPCA and PCA at different SNRs ................. 27

Figure 2.4 Flow network of steam melting process for methanol synthesis plant ................... 28

Figure 2.5 Comparison of model estimates of steam melting process at different SNRs ....... 28

Figure 2.6 Comparison of model estimates at different SNRs for case study 3 ...................... 30

Figure 2.7 Comparison of PCA variants performance at different SNRs for case study 3 ..... 33

Figure 2.8 Flow network of simplified ECC benchmark case study ....................................... 34

Figure 2.9 Comparison of PCA variants performance at different SNRs for ECC case study 35

Figure 2.10 Comparison of PCA variants performance for fault detection ............................. 36

Figure 3.1 Generation-wise best and average fitness values for all the folds .......................... 47

Figure 3.2 Parity plots for general and log solubility predictions using both single model
approaches................................................................................................................................ 49

Figure 3.3 Multiple linear models underlying in different input partitions of data ................. 50

Figure 3.4 Prediction error based Knn strategy to identify a suitable model for a test molecule
.................................................................................................................................................. 55

Figure 3.5 Modified PE based clustering algorithm for drug solubility predictions ............... 60

Figure 3.6 MPD values of all 63 binary systems obtained using multiple models approach .. 61

Figure 3.7 Solubility profiles of two distinct binary systems at various temperatures ............ 63

Figure 4.1 Multiple Model Learning Problem Classification .................................................. 67

Figure 4.2 Flow chart of PE based clustering using variable significance testing for MML .. 75

Figure 4.3 Partition of input data for SMLR example 3(* - C1, o - C2, - C3, + - C4) ......... 82

Figure 4.4 Simulated data - (a) 1000 data samples (training and testing) (b) Global test set .. 93

Figure 4.5 Plots of 𝑦𝑘Vs 𝑦𝑘 − 1 signifying 3 inherent clusters in simulated data for both
outputs ...................................................................................................................................... 94

ix
Figure 4.6 residual error of the global test set using different approaches .............................. 94

Figure 5.1 Group contribution approach framework ............................................................. 102

Figure 5.2 QSPR framework to estimate the property of interest of organic molecules ....... 105

Figure 5.3 Solvation free energy original vs predicted using several approaches ................. 114

Figure 6.1 Some of the widely used machine learning techniques for classification problems
................................................................................................................................................ 120

Figure 6.2 Data distribution and averaged accuracy versus number of models for case study 1
................................................................................................................................................ 127

Figure 6.3 Data distribution and averaged accuracy versus number of models for case study 2
................................................................................................................................................ 130

Figure 7.1 Generalization of first principles model using machine learning approaches to
predict drug solubility ............................................................................................................ 132

Figure 7.2 Various structure-property relationship based frameworks to estimate the properties
of chemical compounds ......................................................................................................... 135

x
CHAPTER 1

Introduction

In this era of big data, it is anticipated that the performance of process industries can be

considerably improved using advanced machine learning (ML) algorithms. Machine learning

algorithms are also gaining interest in a wide variety of fields such as data compression,

scheduling of operations [1], energy management [2], material informatics [3], manufacturing

industry [4], financial forecasting [5], drug discovery [6], genetics and genomics [7] etc. The

growth of machine learning usage in various fields can be attributed to the amount of data that

is available in the respective fields and the ability of machine learning techniques to identify

complex behavior within a reasonable time. Machine learning techniques can be broadly

categorized into solving two types of problems, i.e. regression and classification. Machine

learning techniques for regression identify mathematical relationships between input features

and outputs, which are, in general, continuous variables like the temperature of the system or

price of a commodity etc. Multivariate linear regression, polynomial regression, piecewise

linear regression, and neural networks are some of the machine learning techniques for

regression. Machine learning techniques for classification identify mathematical relationships

between input features and outputs, which are, in general, categorical or ordinal variables like

gender or color of a person or the type of fault identified in system etc. Logistic regression,

support vector machines, neural networks, and random forests are some of the machine learning

techniques for classification.

1
1.1 Motivation

While the use of ML is increasing, it is argued in the scientific communities that ML

approaches such as neural networks might not provide an interpretable physical representation

of the process. It is also highlighted that data-driven models are valid only in the range of the

input feature space of the collected data, extrapolation using these models may result in

inaccurate predictions [8], [9]. At the same time, though first principles (FP) models have good

physical interpretation, it is challenging and time-consuming to develop the models and also

process knowledge may not be available, a priori, for most complex processes [8], [9]. Thus,

the integration of both first principles and data-based modeling can yield benefits such as ease

of modeling and improved predictions [8], [9]. Frameworks that combine both first principles

and data-driven models are termed as hybrid modeling in process modeling communities.

One of the earliest frameworks to incorporate first principles knowledge in the form of

constraints into machine learning algorithms was proposed by Joerding and Meador [10].

Psichogios and Ungar [11] proposed a hybrid framework to obtain model parameters of a first

principles derived model using neural networks for dynamic modeling of a fed-batch reactor.

Su et al. [12] proposed a modeling framework to integrate first principles and machine learning

approaches in which the residual errors of the first principles model are predicted using neural

networks. They [12] highlighted that such integration will be beneficial provided the

performance of first principles are not adequate and the residuals contain some information

about the process which is captured using machine learning techniques. Milanic et al. [13]

proposed an approach to incorporate domain knowledge in artificial neural networks for

optimizing the quality of a hydrolysis batch process, where first principles models are used to

obtain enormous augmented data and the model structure whereas neural networks are used to

obtain the model parameters. Kahrs and Marquardt [14] proposed two complementary criteria

to validate the applicability domain of hybrid models for process optimization. Any hybrid

2
model that satisfies these criteria is assumed to be robust in the whole operating regime. Stosch

et al. [9] reviewed most commonly used hybrid modeling frameworks and their applications in

process industries for several objectives such as monitoring, optimization, and control, etc.

Integration of ML and FP

ML FP FP ML FP ML

Outputs Inputs Outputs Inputs


First principles
Model Model
First principles First principles
Outputs Inputs

Model Machine learning Machine learning


Machine learning

Inputs
Outputs Inputs Model 1
First principles
Model
Machine learning
Machine learning
Constraints
Model 2
First principles Outputs

Figure 1.1 Various ways of integrating machine learning and domain knowledge

3
In this thesis, we categorize integration of machine learning techniques and first principles

models as depicted in Figure 1.1. The integration can be broadly classified into three major sub

categories i.e. major contribution from ML and a minor contribution from FP, the major

contribution from FP and a minor contribution from ML, and equal contributions from both.

The first category exist in two forms, i.e., obtaining ML models from FP derived inputs [13]

and obtaining ML models with domain constraints from FP [10]. Second category approaches

also exist in two forms, i.e., obtaining model parameters using ML in FP derived models [11]

and modeling the residuals of FP models using ML [12]. The final category approaches

formulate hybrid models specific to domain applications, where the integration is lot more

coupled and application specific. In this thesis, we use some of these integration concepts for

solving real-life engineering problems by combining domain knowledge and machine learning

techniques. It should be noted that enough care should be taken during the modeling of such

integrated models such that the disadvantages by individual techniques are minimized in the

hybrid models. We use principal component analysis (PCA) and multiple model learning

(MML) as ML tools.

1.2 Thesis contents

 Incorporation of process information such as a subset of true constraints or sparse

information of constraint matrix in PCA framework for model identification.

 Identification of model parameters in first principles derived model to estimate drug

solubility in a binary solvent system using machine learning approaches.

 Modified clustering approach that can process feature selection and model

identification together in a piecewise linear modeling framework.

 Machine learning models using first principles derived descriptors, i.e., inputs to

predict solvation free energy of Quinone derivatives for flow battery applications.

 Piecewise SVM approach for binary classification.

4
1.3 Organization of the dissertation

In this dissertation, initially, a brief introduction to various frameworks to integrate domain

knowledge with machine learning techniques is provided in chapter 1. In chapter 2, the

structural information i.e. active variables in each linear relation are incorporated in the PCA

framework to get better model estimates. In chapter 3, the efficacy of multiple model learning

approach is examined in a quantitative structure property relationship (QSPR) framework to

predict drug solubility in binary solvent systems. In this work, a modified version of Jouyban-

Acree model [15] is generalized to predict the solubility of a drug in a given binary solvent

system at a given temperature, in which model parameters are estimated as functions of

structural descriptors/features. In this case study, input features and the model form are

obtained from domain knowledge and to identify the significant input features, genetic

algorithm is used. It is observed in the feature selection phase that some of the input features

are insignificant in particular regions of feature space. In chapter 4, to obtain significant input

features and corresponding model parameters in each partition together, without any

assumptions, a statistical testing based prediction error (PE) clustering approach is proposed.

In chapter 5, the proposed approach is used to predict the solvation free energy of Quinone

molecules as a function of first principles derived input features. In chapter 6, inspired by the

performance of PE based approaches for regression problems, the idea of prediction error and

fuzzy membership is further extended to identify the non-linear boundary between the classes

using piecewise linear SVM models. Finally, in chapter 7, conclusions and possible future

directions are provided. A detailed problem specific literature review is provided in chapters 2

to 6 followed by the respective problem statements.

5
CHAPTER 2

Integration of process information in the PCA framework

Model identification is crucial in process industries for various applications such as process

automation, controller implementation, etc. Neural networks [16], [17], multiple models [18],

[19], principal component analysis [20] are some of the widely used techniques for model

identification in process industries. In most chemical processes, linear models suffice due to

the linearity of the process around steady-state operating conditions and ease of

implementation. Principal component analysis (PCA) is a popular machine learning approach

widely used for dimensionality reduction and data reconciliation in scientific communities.

PCA also used in chemical industries for process monitoring [21], [22], and fault detection and

diagnosis [20], [23]. In chemical industries, it is possible to obtain partial information about

the process states. Information about a subset of model equations or sparsity of the model

structure can be obtained in the form of process flow-sheets and heuristics. In order to get better

estimates of the process model, it is desired to incorporate this useful knowledge in the model

identification exercise.

2.1 Literature survey

Common model identification techniques lack the freedom to incorporate partial process

knowledge. Principal component analysis (PCA), one of the most widely used methods for

linear steady-state model identification, in its vanilla form does not provide the freedom to

incorporate information about model sparsity. Sparse PCA [24], though provides a sparse

representation of the data, does not inherently incorporate the information. It is primarily used

6
to find sparse representations of high dimensional datasets [25], [26]. In a similar way, there

does not exist a formulation to incorporate knowledge in the form of a subset of model

equations governing system dynamics, in conventional methods.

PCA projects a dataset to a lower dimensional subspace, by preserving maximum variations

in the dataset [27], and excluding the minimal variations characterizing them as noise. The

directions of maximum variability, called principal components (PCs), are used to obtain

“useful” variations in the data, making PCA a popular denoising technique[28], [29]. The

directions of minimum variability can be used as directions orthogonal to the dataset, and thus

can be used to obtain a set of model equations for a linear process generating the dataset [27],

[30], [31]. Another approach working along similar lines is network component analysis

(NCA)[32]. NCA tries to utilize the information pertaining to network structure for model

identification. Similar approaches to utilize prior knowledge about the system can be seen in

various domains of engineering. Few of the closely related approaches are robust PCA and its

variants [33]–[36], and sparse PCA and its variants[37], [38]. Most of these approaches have

to sacrifice the simplicity in PCA formulation to incorporate the essential system information.

In this article, we propose algorithms for estimating the entire model, given partial process

knowledge in two particular forms – (i) Cases when a subset of model equations are known,

(ii) Cases when the sparse elements of the model equations are known. As an exemplar, we use

the novel PCA formulation with minimal changes to incorporate the partial information

available for the system. For this purpose, PCA is coupled with variable sub-selection

procedures and is reported to give better estimates of the process model. The rest of the chapter

is organized as follows. Sections 2.2 and 2.3 cover the mathematical foundations and

formulation of PCA required to follow the proposed approaches. Section 2.4 discusses the case

to utilize the information about the subset of known model equations. The proposed algorithm

is termed as constrained PCA (cPCA). We discuss the algorithm to incorporate the sparsity

7
information in section 2.5, which is termed as structural PCA (sPCA). In section 2.6, we

combine the proposed approaches for better estimates in a case of similar structural information

as the previous sections. Finally, we conclude the chapter by highlighting major insights from

the performance of proposed algorithms.

2.2 Mathematical foundations

We start the discussion on the model identification problem for noise-free data. PCA is one

of the most widely accepted approaches for this purpose [27], [39] but our intention lies in

presenting a novel perspective of PCA which is understated in the literature. It will also help

to develop motivation for the proposed method in the next section. Let 𝑥(𝑡) be a 𝑛 × 1 vector

consisting measurements of 𝑛 variables at time instant 𝑡. It is assumed that these 𝑛 variables

are related by 𝑚 linear equations at all time instants. This may be formally stated as

A0 x  t   0m1 t (1.1)

Where 𝐴0 ∈ 𝑅 𝑚×𝑛 is a time-invariant constraint matrix. In this chapter, A or constraint matrix

is interchangeably referred to as model. At each time instant, measurement 𝑦(𝑡) of all the 𝑛

variables is assumed to be corrupted by noise.

y t   x t   e t  t (1.2)

The following assumptions are made on the random errors:

1. e  t  ~   0,  2 I 

2. E  e  j  eT  k     2 j ,k I nn

3. E  x  j  eT  k    0, j , k

Where E . is the usual expectation operator and e  t  is a vector of white-noise errors, with all

elements having identical variance  2 as stated above. We introduce the collection of 𝑁 such

noisy measurements as follows

8
Y   y  0 y 1 y  N  1
T
(1.3)

X   x  0 x 1 x  N  1
T
(1.4)

Given 𝑁 noisy measurements of 𝑛 variables, the objective of the PCA algorithm is to estimate

the constraint model 𝐴0 in Equation(1.1). In the next section, we formally describe theoretically

relevant aspects of PCA and subsequently pursue our problem of interest.

2.3 PCA for Model Identification

PCA or total least squares method can be formulated as an optimization problem described

below to obtain model parameters.

N
min   y  i   x  i    y i   x i 
T
(1.5)
A, x  i 
i 1

AAT  I mm ;
Subjected to (1.6)
A x  i   0m1 ; i  1, ,N

Where 𝐴 is is referred as the model. It is well known that PCA algorithm utilizes the eigenvalue

analysis or equivalently singular value decomposition (SVD) to solve the above optimization

problem[27], [39]. So, we briefly discuss the utilization of novel eigenvalue decomposition for

deriving the model parameters.

1 nn
The sample covariance matrix of 𝑌 is defined as Sy  YY T Sy  (1.7)
N

The eigenvalue decomposition of the sample covariance matrix 𝑆𝑦 is stated as follows:


nn n n
S yU  U  U ,  (1.8)

Where  is a diagonal matrix containing the eigenvalues and 𝑈 consists of the eigenvectors

corresponding to those eigenvalues. If the noise-free measurements (𝑋 in Equation(1.4)) are

accessible, the constraint model can be derived from the eigenvectors corresponding to zero

eigenvalues. This can be intuitively seen by eigenvalue analysis for the covariance matrix of

9
noise-free measurements[39].

1
S xU *  U ** Sx  XX T (1.9)
N

nm
S xU 0*  U 0* 0mm  0nm ; U 0*  (1.10)

A0  U 0* 
T
(1.11)

Where, the columns of U 0* contains the eigenvectors corresponding to zero eigenvalues. For

the noisy measurements in Equation(1.7), the eigenvectors corresponding to “small”

eigenvalues are chosen. For the homoscedastic case, it can be proved that few of the “small”

eigenvalues are equal to each other asymptotically and provide an estimate for noise variance

in each 𝑛 variables. It should be noted that PCA provides a set of orthogonal eigenvectors

which is a basis for the constraint matrix.

It can be easily proved that PCA provides the total least squares (TLS) solution [27] but

doesn’t grant the freedom to include any available knowledge of the process in its formulation.

PCA derives the most optimal decomposition based on statistical assumptions without

incorporating any process information. Ignoring the underlying network structure leads to the

minimum cost function value of PCA in Equation (1.5) but may drive us away from the true

process. On the other hand, reformulating the optimization problem with the inclusion of a

priori knowledge as constraints will lead us to a solution closer to the true process. A similar

approach is adopted in sparse PCA [24], dictionary learning [37], regularization approaches

[40], [41] to derive estimates of improved qualities.

In this section, we briefly discussed PCA and acquired the required background to

understand the proposed algorithms in later sections. In the next section, we discuss the

approach to utilize the information about a set of linear relations to derive the full constraint

matrix/model.

10
2.4 Model Identification with partially known constraint matrix (cPCA)

In this section, we assume the availability of a few linear relationships among 𝑛 variables.

Basically, it is presumed that few rows of the constraint matrix, 𝐴0 in Equation (1.1) are

available. It should be noted that all the linear relationships are not assumed to be known but

instead, only a few of them are available. We propose an algorithm termed as constrained

principal component analysis (cPCA) to utilize the partially known information of constraint

matrix. A simple case-study is considered to illustrate the key idea and assumptions. The

optimization problem for the partially known constraint matrix can be formally stated below.

N
min   y  i   x  i    y i   x i 
T
(1.12)
A, x  i 
i 1

AAT  Ill ;
Subjected to: (1.13)
Af x  i   0m1 ; i  1, ,N

A  mn  m l n l n
Where Af   kn  Af  , Akn  , A (1.14)
 A

It is assumed the (𝑚 − 𝑙) linear equations are known to the user and the rest 𝑙 are to be

estimated. 𝐴𝑓 and 𝐴𝑘𝑛 represents full and known constraint matrix respectively. It should be

noted the first constraint in (1.13) is imposed only on the unknown segment of full constraint

matrix to obtain a unique subspace up to a rotation. Consider a simple flow mixing network

example shown in Figure 2.1.

𝑥1 𝑥2 𝑥3 𝑥4
1 2 3

𝑥5

Figure 2.1 Flow mixing case study

11
This network could be easily seen in various engineering disciplines like electrical circuits or

water distribution in pipelines. The flow balance at each node, at any time instant 𝑡 can be

stated as:

x1  t   x2  t   x5  t   0 Node 1
x2  t   x3  t   0 Node 2 (1.15)
x3  t   x4  t   x5  t   0 Node 3

The model equation of this flow network corresponding to noise-free measurements at three

nodes can be stated as, 𝐴0 𝑋(𝑡) = 0, where,

1 1 0 0 1 
A0  0 1 1 0 0  (1.16)
0 0 1 1 1

X  t    x1  t  x2  t  x3  t  x4  t  x5  t   (1.17)

As stated earlier, the noise-free measurements – 𝑋(𝑡) are not accessible. Instead, we are

supplied the noisy measurements of 𝑋(𝑡), denoted by 𝑌(𝑡) in Equation(1.2). It is assumed that

a collection of 𝑁 such noisy measurements are available as stated in Equation(1.4). The noise

used to corrupt the true measurements is given by e  t  ~   0,  2 I55  , where   0.0909 and

I55 represents an identity matrix of dimension of 5  5 . For this case study, we assume to have

a priori knowledge of the linear relation generated by flow balance on node 1. Therefore,

Akn  1 1 0 0 1 (1.18)

One of the naive approaches would be applying PCA without utilizing the knowledge about

known linear relation. Eigenvalue decomposition of the sample covariance matrix defined in

Equation (1.7) is adopted to obtain the constraint matrix estimate by PCA, denoted by 𝐴̂𝑝𝑐𝑎 .

The eigenvectors corresponding to three smallest eigenvalues provide 𝐴̂𝑝𝑐𝑎 .

12
 0.23 0.49 0.02 0.70 0.46 
ˆA   0.12 0.49 0.79 0.20 0.30  (1.19)
pca  
 0.74 0.39 0.05 0.32 0.44

It may be argued intuitively that applying PCA directly in the above case by ignoring the

available information will drive the user away from true system configuration. This will be

later used for comparison to the proposed method. We proceed to discuss the proposed

algorithm termed as constrained principal component analysis (cPCA). The objective of this

algorithm is to utilize the available information and estimate only the unknown part of

constraint matrix as formulated in Equations (1.12) and(1.13). For any general known part of

 m l n
constraint matrix Akn  ,

Akn y  t   Akn x  t   Akn e  t   Akn e  t  t (1.20)

For a collection of 𝑁 measurements defined in Equation(1.4), the above may be restated as,

AknY  Akn X  Akn E  Akn E (1.21)

To estimate a basis for the rest of linear relations, we attempt to work with data projected on to

null space of 𝐴𝑘𝑛 . This can be mathematically stated as,

n n  ml   n ml  N


Akn X p  X ; Akn  ,Xp  (1.22)

Where Akn denotes the null space of 𝐴𝑘𝑛 . As the noise-free measurements are not available,

Equation (1.22) is restated as,

Akn X p  Y  E (1.23)


It should be noted that estimating X p given Akn and Y leads to overdetermined set of

equations as there are 𝑛 equations for each set of the (𝑛 − 𝑚 + 𝑙) variables in columns of X p

. This leads to a total of 𝑁 × 𝑛 equations in 𝑁(𝑛 − 𝑚 + 𝑙) variables. An estimate of the

projected data on the null space of 𝐴𝑘𝑛 can thus be obtained in least squares sense.

13
 A  
1
Yˆp   Akn  A 
†  T  T
Y kn Akn kn Y (1.24)

A 

Where Yˆp denotes an estimate of X p and  †
kn
denotes the pseudo-inverse of Akn . The

unknown part of the constraint matrix estimate, denoted by 𝐴 in full constraint matrix 𝐴𝑓

presented in Equation (1.14) can be estimated by applying PCA on projected data Yˆp shown in

Equation(1.24). The sample covariance matrix of projected data can be defined similar to

Equation(1.7),

1 ˆ ˆT  n  m l  n  m l 
S yp  Y pYp ; S yp  (1.25)
N

The eigenvalue decomposition of S yp , as defined earlier, can be written as,

S ypU p  U p  p (1.26)

The eigenvectors corresponding to 𝑙 smallest eigenvalues in  p , call it Ap provides a basis for

the constraint matrix of data in projected space. It should be noted that the original data in 𝑛 -

dimensional space was projected in lower (𝑛 − 𝑚 + 𝑙) - dimensional space to estimate the 𝑙

linear relations.


Aˆ p  U p : ,  n  m  1 :  n  m  p   ;  Aˆ p  l  n  m l 
T
(1.27)

Using the above with Equations (1.22) and(1.24), the following can be stated

Aˆ p X p  0lN (1.28)

ˆ  A
A 

X  0l N  AX  0l N (1.29)
p kn

So, the constraint for original n-dimensional space can be obtained from reduced dimensional

space by using

 A  
1
A ˆ  A
ˆA 

ˆ
A  T 
Akn A  T
; A l n
(1.30)
p kn p kn kn

14
The full constraint matrix can be obtained as stated in Equation(1.14). Revisiting the flow

mixing case study of 5 variables, the full constraint matrix obtained is stated below. Please note

that 𝐴𝑘𝑛 is specified in Equation(1.18).

 1 1 0 0 1 
 Akn  
Aˆ pca      0.53 0.34 0.14 0.74 0.19  (1.31)
ˆ
 A   0.07 0.42 0.77 0.30 0.36
 

To investigate the goodness of estimates from both methods, we utilize the subspace-

dependence based metric stated in Narasimhan and Shah [31] and briefly mentioned here. The

subspace-dependence metric can be viewed as the distance between the row spaces of the true

(𝐴0 ) and estimated constraint matrix (𝐴̂). The minimum distance of each row of 𝐴0 from the

row space of 𝐴̂ in least squares sense is given by

 
m 1
   A0i  A0i Aˆ T AA
ˆ ˆT Aˆ (1.32)
i 1

The true constraint matrix specified in Equation (1.16) is used to evaluate the accuracy of

estimates obtained by PCA and cPCA specified in Equation (1.19) and (1.31). The subspace

metric defined in Equation (1.32) is used to compare the estimates.

PCA  0.0295, cPCA  0.0197 (1.33)

It may be easily inferred from the subspace dependence metric values that the proposed

algorithm cPCA outperforms PCA. This simple case-study with synthetic data was presented

for the ease of understanding the notations and demonstrating the key idea of cPCA. In this

section, the discussion started from the problem of estimating the constraint matrix when a

subset of linear relations is already specified. Ideally, one could easily formulate this problem

into squared error cost function with appropriate constraints as specified in Equations(1.12)

and (1.13). But, unfortunately, the inclusion of the a priori available linear relations deviates

15
the cPCA optimization problem specified in Equations(1.12) and (1.13) from the standard PCA

optimization problem mentioned in Equations(1.5) and (1.6).

The novel contribution of this work is to wisely utilize the available information about a

subset of linear relations and transforming the original problem stated in (11) to PCA friendly

framework. This rewarding step provides us the freedom to include the prior available

information and also the ease of implementation through the analytical solution by PCA.

Basically, this is performed in two steps. The first one is projecting the data in null space of

known linear relation and the second step is applying PCA in the reduced space. Finally, the

obtained solution is transformed back from reduced to original space. The pseudo code of the

proposed cPCA algorithm is given below. We show the efficacy of the proposed algorithm

over PCA on another multivariable case-study in the next subsection.


Obtain the null space Akn for a given set of (𝑚 − 𝑙) of linear relations among 𝑛 variables.


Obtain the projection of data Yˆp , on to the null space Akn using Equation(1.24)

Apply PCA on the lower dimension projected data Yˆp to obtain Aˆ p .

Transform the estimated Aˆ p in the previous step to original subspace using Equation(1.30).

Construct the full constraint matrix using Equation(1.14).

2.4.1 Model Identification when a subset of linear relations are known

In this case study, we intend to study the goodness of estimates obtained by constrained

PCA algorithm. For this purpose, we consider a system with 5 linear relations among 10

variables. The constraint matrix – 𝐴0 of dimension 5  10 corresponding to 5 linear relations is

chosen randomly. It follows:

A0 x  t   051 ; A0  510
, x t   101
(1.34)

16
Where 𝑥(𝑡) is a column vector containing the noise-free measurements of 10 variables at

time instant 𝑡. Thousand such noise-free measurements are generated of 𝑥(𝑡) using the null-

space of 𝐴0 . It can be inferred from Equation(1.34) that the 𝑋 lies in the null-space of 𝐴0 .

Hence, the noise-free data is generated by a linear combination of the null-space, with the

coefficients chosen randomly. We use Equation (1.2) for generating the noisy data – 𝑌. The

noise can be characterized by e  t  ~   0,  2 I1010  with a standard deviation 0.0113 . Proposed

cPCA algorithm can be applied when a subset of linear constraints are known a priori. In this

case study, we will stepwise increase the number of linear constraints known a priori and

observe its effect on the quality of results. For the purpose of comparison, the constraint matrix

is estimated via traditional approach – PCA and the proposed algorithm – cPCA. For a given

𝐴̂, the reconciled estimate of measurements – 𝑌̂ can be estimated.

The accuracy of the estimated model will be characterized by the 2-norm of error. The error

is usually calculated with respect to noisy measurements due to the availability of noisy

measurements in practical situations but in this case, we calculate the error with respect to true

measurements for the purpose of comparison.

Errmeas  Y  Yˆ Errtrue  X  Yˆ (1.35)


2 2

Where 𝑌̂ will be estimated by PCA and cPCA algorithm. 𝐸𝑟𝑟𝑚𝑒𝑎𝑠 can be also visualized as the

cost function in the PCA algorithm as stated in (1.5). The results by both the algorithms are

presented in Figure 2.2.

It can be inferred from Figure 2.2:

1. PCA algorithm gives a lower cost function value compared to cPCA when the error is

calculated with respect to measurements. This is not surprising as cPCA algorithm has

the same objective function with additional constraints. It should be noticed that as the
17
number of linear relations known a priori is increased, the difference in the cost function

for cPCA and PCA increase as additional constraints are included.

2. It can be inferred from the plot of 𝐸𝑟𝑟𝑡𝑟𝑢𝑒 that constraint matrix estimate by cPCA

algorithm is much closer to the true process compared to PCA. It should be noted that

as more linear relations are supplied, the estimate by cPCA is driven towards true vales.

In this case study, we demonstrated the efficacy of the estimated constraint matrix by the

proposed cPCA algorithm for a complex network. In the next section, we consider a tougher

problem of estimating the model when the structure of the constraint matrix is known instead

of a subset of linear relations as seen in this section.

2.5 Model Identification with known model structure (sPCA)

In this section, we address a more challenging and practical problem of incorporating the

knowledge about the structure of the entire constraint matrix. This essentially means we assume

to have a priori knowledge about the set of variables which indulge to satisfy each linear

relationship. For example, the structure of the constraint matrix for flow mixing case study

presented in Figure 2.1 would be

* * 0 0 *
structure  A0   * * * 0 0 (1.36)
0 0 * * * 

The above structure provides us the essential information about the set of variables combining

linearly at each node of the flow network. This information about which variables are related

by linear relation may be easily available in flow distribution networks [31]. Utilizing this

valuable information in the formulation of the optimization problem (one optimization problem

for each constraint) for estimation of the constraint matrix will lead us to a better solution as

discussed earlier.

18
Figure 2.2 Euclidean norm of residuals using both approaches

19
In this section, we present a novel approach to estimate the constraint matrix of a given

structure without getting drowned into imposing sparsity constraints. The key difference in the

methodology of the proposed algorithm and the existing frameworks is to estimate each row

of the constraint matrix, meaning each linear relation separately rather than the whole

constraint matrix. The linear relations estimated sequentially are stacked together at the end to

construct the entire constraint matrix. This idea of estimating linear relations separately equips

us with considerable freedom to incorporate the structural constraints without diving into

sparsity constraints. Of course, it brings in some new challenges which are addressed in a

detailed manner. In order to demonstrate a wide range of challenges and the proposed remedies,

simple constraint matrices are considered. The first step of the proposed algorithm is

rearranging rows of the constraint matrix structure in ascending order of cardinality for non-

zero elements in each row. We chose an example to skip this step to illustrate the key idea but

this has been explained later in the section.

Linear relations corresponding to each row of constraint matrix structure are separately

estimated via sub-selection of variables. For example in flow mixing case study, the structure

given in Equation(1.16) can be estimated by applying PCA to the subset of variables

participating at each node separately. For instance at node 1 in Figure 2.1, variables 𝑦1 , 𝑦2 and

𝑦5 will be considered.

Ysub1  t    y1  t  y2  t  y5  t   (1.37)

Applying PCA on a collection of 𝑁 measurements of 𝑌𝑠𝑢𝑏1 (𝑡) will deliver us a row vector

𝐴𝑠𝑢𝑏1 of dimension 1 × 3 such that 𝐴𝑠𝑢𝑏1 𝑋𝑠𝑢𝑏1 (𝑡) = 0, where 𝑋𝑠𝑢𝑏1 (𝑡) contains the noise-

free measurements of a sub-selected set of variables commensurate to 𝑌𝑠𝑢𝑏1 (𝑡) in Equation

20
(1.37). It should be noted that estimated constraint row vector will only contain the non-zero

entries corresponding to sub-selected variables. Basically, we mean that the structure will be,

Aˆ sub1   aˆ11 aˆ21 aˆ51  (1.38)

Where 𝑎̂𝑖1 correspond to the coefficient of 𝑖 𝑡ℎ variable. The desired structure for the first row

of the constraint matrix could be constructed by appending zeros at the desired locations as

shown below

Aˆ1   aˆ11 aˆ21 0 0 aˆ51  (1.39)

This procedure could be similarly applied at nodes 2 and 3 in Figure 2.1 to estimate row

constraint vectors 𝐴̂2 and 𝐴̂3 respectively. The entire constraint matrix can be constructed by

stacking the estimated linear relations. The true constraint matrix specified in Equation (1.16)

and subspace dependence metric mentioned in Equation (1.32) is used for the evaluation of the

efficacy of estimated constraint matrix by the proposed algorithm, which we term as structural

principal component analysis (sPCA). Proposed algorithm is tested for 1000 runs of MC

simulations with SNR = 10 and the averaged subspace dependence metric is reported in

Equation(1.40). It can be easily inferred from Equation(1.40) that sPCA estimate is much closer

to the true constraint matrix compared to PCA.

PCA  0.1293, sPCA  0.1188 (1.40)

It is interesting to note from Figure 2.1 that node 1, 2 and 3 can be considered as a single node

to derive the linear relation among variables 𝑥1 and 𝑥4 . So applying traditional PCA may reveal

the linear relation among the variables 𝑥1 and 𝑥4 . Unfortunately, this phenomena creates a

challenging issue which can be dealt with appropriate modification in the sPCA approach

discussed previously. To illustrate this phenomena, let us consider another simple example of

desired constraint matrix stated below:

21
* * * * 0 *
* * * * 0 0 
structure  A0    (1.41)
* 0 * 0 0 0
 
* * * * 0 0

We intend to estimate each linear relation separately starting from the first row of

𝑠𝑡𝑟𝑢𝑐𝑡𝑢𝑟𝑒 (𝐴0 ) specified in Equation(1.41). The sub-selected variables would be

Ysub1  t    y1  t  y2  t  y3  t  y4  t  y6  t   (1.42)

Applying PCA on 𝑁 measurements of 𝑌𝑠𝑢𝑏1 (𝑡) may not deliver us the desired structure

specified in the first row of 𝑠𝑡𝑟𝑢𝑐𝑡𝑢𝑟𝑒 (𝐴0 ) specified in Equation(1.41). This may occur as the

complementary set of zero locations in row 2, 3 and 4 of 𝑠𝑡𝑟𝑢𝑐𝑡𝑢𝑟𝑒 (𝐴0 ) specified in

Equation(1.41) are a subset of the complementary set of zero locations in row 1. It basically

means the idea of applying PCA on sub-selected variables doesn’t guarantee a non-zero

coefficient of the selected variables. Sub-selection only guarantees the zero coefficient of the

discarded variables. Ignoring this fact could lead us to estimate a linear relation corresponding

to the structure specified in row 2 of Equation(1.41) when we intend to estimate the relation

corresponding to the structure of row 1. If we ignore the above scenario and proceed to estimate

2nd row of the constraint matrix with the desired structure by sub-selection of variables, we

may end up in estimating same previously estimated linear relation. This may also lead us to

miss out the first constraint as the variable 𝑥6 (𝑡) will not be sub-selected in any of the

consecutive iterations.

We propose a novel approach to deal with such a scenario. The primary concern was

ambiguity in the estimated relationship to be of the structure we intended. This issue raises

doubts mainly due to the estimation of constraint with more zero entries afterward. Such a case

could be avoided by re-configuring the structure of a given constraint matrix. As we intend to

estimate the constraint with less number of zeros afterward, corresponding rows are pushed

22
down. So, the constraint matrix is re-structured in ascending order of the number of non-zero

locations in each row. The objective of this step is to avoid estimation of the individual

constraints which are already estimated. The constraint matrix in Equation(1.41) can be re-

structured as

* 0 * 0 0 0
* * * * 0 0 
structure  A0    (1.43)
* * * * 0 0
 
* * * * 0 *

This rewarding step ensures obtaining the constraint with lower cardinality of non-zero

elements before compared to constraints with higher cardinality but it still does not resolve the

ambiguity in obtaining same linear relation (constraints with a similar structure) when

constraints of with more number of variables are intended to be estimated. We propose a two-

step remedy which is illustrated as follows:

1. Detection: Such cases could be identified by a rank check of the linear relation obtained

at each step. Let the constraint matrix up to 𝑖 𝑡ℎ row be 𝐴̂𝑖 and the linear relation obtained

from 𝑖 + 1𝑡ℎ row be 𝑎̂𝑖+1 . If we obtain a constraint at 𝑖 + 1𝑡ℎ step which is just a linear

𝐴̂𝑖
combination of previously estimated constraints, then rank of [ ] will be the same
𝑎̂𝑖+1

as rank of ̂
𝐴𝑖 . This idea is used for detection of previously estimated constraint.

2. Identification: It should be noted that the cause for detecting a previously estimated

constraint is the existence of multiple constraints. In order to filter the right constraint

from a set of multiple constraints, the idea of rank check is utilized again. Let the full

row rank constraint matrix estimated up to 𝑖 𝑡ℎ row be 𝐴̂𝑖 . For 𝑖 + 1𝑡ℎ row, we propose

to consider all the eigenvectors instead of one eigenvector corresponding to the

minimum eigenvalue. This is done because the set of all eigenvectors is a superset of all

23
the constraints identified till 𝑖 + 1𝑡ℎ iteration. For example in the 2nd iteration for the

structure provided in Equation(1.43), the subset of variables would be

Ysub 2  t    y1  t  y2  t  y3  t  y4  t   (1.44)

Applying PCA on 𝑁 measurements of 𝑌𝑠𝑢𝑏2 (𝑡) should ideally reveal 3 linear relations

but it is known to us from the given structure that there exist only 2 linear constraints

for this particular row-structure. Those 2 linear relations can be filtered from the 3

constraints using a rank check. The above procedure is formally stated below.

We define the matrix 𝐵̂𝑖+1 which contain the eigenvectors along its rows in 𝑖 + 1𝑡ℎ

iteration. It should be noted that these eigenvectors are arranged along the rows such

that the eigenvalues are increasing with increasing row numbers. Let the dimension of

𝐵̂𝑖+1 be 𝑛𝑖+1 × 𝑛𝑖+1 and its 𝑗 𝑡ℎ row be denoted by 𝑏̂𝑖+1,𝑗 . First, we make the hypothesis

that the 𝑗 𝑡ℎ row of 𝐵̂𝑖+1 - 𝑏̂𝑖+1,𝑗 contains an independent constraint. We define

 Aˆi 
Aˆi , j    (1.45)
bˆi 1, j 

To test the hypothesis, we compare the rank of 𝐴̂𝑖,𝑗 and 𝐴̂𝑖 . If the ranks of both matrices

are equal, then 𝑏̂𝑖+1,𝑗 is rejected, otherwise 𝐴̂𝑖 is updated using Equation(1.46) because

it contains a new relation.

 Aˆi 
ˆ
Ai    (1.46)
bˆi 1, j 

The number of constraints to be chosen from this 𝑖 + 1𝑡ℎ iteration will be known from

the given structure. Let it be 𝑚𝑖+1 . So this process of detection and filtering right

constraint is carried out until 𝑚𝑖+1 constraints are identified.

24
The estimated constraint matrix could be easily reconfigured according to the original

specified structure once all the constraints are estimated for the re-structured 𝐴0 . In this section,

we discussed the main theme of sub-selecting variables in the proposed algorithm with the help

of flow-mixing case-study. This example demonstrated the efficacy of the results via the

proposed algorithm. Later on, various challenges and remedies were illustrated with the help

of another constraint matrix. The pseudo code of the proposed sPCA algorithm is provided

below. Three diverse case-studies are presented in the next sub-section to show the utility and

performance of the proposed algorithm.

1. Given the structure of the constraint matrix 𝐴𝑠𝑡𝑟𝑢𝑐𝑡 of dimension (𝑚 × 𝑙) configure it such

that 𝑓(𝑖 + 1) ≥ 𝑓(𝑖); ∀𝑖 ∈ {1, … , 𝑚 − 1} where 𝑓(𝑖) is the number of non-zero elements

in row 𝑖 of 𝐴𝑠𝑡𝑟𝑢𝑐𝑡 . Let the re-configured matrix be 𝐴𝑟𝑒−𝑠𝑡𝑟𝑢𝑐𝑡 . Let 𝐺(𝑗) be the count of the

number of rows in 𝐴𝑟𝑒−𝑠𝑡𝑟𝑢𝑐𝑡 having a similar structure with 𝑗 𝑡ℎ row of 𝐴𝑟𝑒−𝑠𝑡𝑟𝑢𝑐𝑡 . Initialize

𝐴̂𝑒𝑠𝑡,𝑖 = [ ] for iteration 𝑖 = 1.

2. For iteration 𝑖 ≥ 2, perform the structure similarity test of 𝑖 𝑡ℎ and 𝑗 𝑡ℎ rows of 𝐴𝑟𝑒−𝑠𝑡𝑟𝑢𝑐𝑡 ,

where 𝑗 ∈ {1, … , (𝑖 − 1)}. If there is any match, discard the 𝑖 𝑡ℎ row of 𝐴𝑟𝑒−𝑠𝑡𝑟𝑢𝑐𝑡 and revisit

step 2 with 𝑖 = 𝑖 + 1, else proceed to next step.

3. For iteration 𝑖, apply PCA on the sub-selected set of variables from 𝑌 corresponding to the

structure of 𝑖 𝑡ℎ row of 𝐴𝑟𝑒−𝑠𝑡𝑟𝑢𝑐𝑡 . Let the number of sub-selected variables and

measurements matrix be 𝑛𝑠𝑢𝑏,𝑖 and 𝑌𝑠𝑢𝑏,𝑖 respectively. Collect all eigenvectors of the sample

covariance matrix of 𝑌𝑠𝑢𝑏,𝑖 to obtain 𝐴̂𝑠𝑢𝑏,𝑖 .

4. Include zeros in 𝐴̂𝑠𝑢𝑏,𝑖 corresponding to the structure of 𝑖 𝑡ℎ row of 𝐴𝑟𝑒−𝑠𝑡𝑟𝑢𝑐𝑡 to obtain 𝐴̂𝑖 .

Note that the dimension of 𝐴̂𝑖 is 𝑛𝑠𝑢𝑏,𝑖 × 𝑛.

5. Filter the correct linear relations by performing a rank test on constraints identified in

iteration 𝑖. For 𝑘 = {1, … , 𝑛𝑠𝑢𝑏,𝑖 }.

25


Aˆest ,i    
rank Aˆest ,i  rank Aˆest ,i ,k ;

Aˆest ,i    Aˆ  (1.47)

  A
est ,i

ˆ  k ,:         
rank Aˆest ,i  rank Aˆest ,i ,k & nrow Aˆest ,i  nrow Aˆest ,i 1   G i ;
 i 

 Aˆ 
Where Aˆest ,i ,k   est ,i  , Aˆi  k ,: denotes the 𝑘 𝑡ℎ row of ̂𝐴𝑖 , nrow  Aˆest ,i  denotes the number
 Aˆi  k ,: 

of rows in 𝐴̂𝑒𝑠𝑡,𝑖 and 𝐺(𝑖) is defined in step 1.

Terminate this step if nrow  Aˆest ,i   G  i   nrow  Aˆest ,i 1  to improve computational

efficiency.

6. Repeat the entire procedure from step 2 until nrow  Aˆest ,i 1   m

7. Map the estimated constraint matrix to the original form supplied by the user in step 1.

2.5.1 Case study 1

This is a synthesised case study to show the efficacy of the proposed approach when the

structure of the constraints are known. The original constraints and the structural information

of the same are given below. Constraint matrix consists of six variables, in which two variables

are out of the scope (i.e. absent) for the constraints considered in this case study.

1 1 0 0 0 0  * * 0 0 0 0
A0   1 2 3 0 0 0  structure  A0   * * * 0 0 0  (1.48)
3 1 1 2 0 0  * * * * 0 0 

To compare the proposed sPCA approach with the traditional PCA, 500 MC simulations

have tested for SNR values 10, 20, 50, 100, 200, 500, 1000 and 5000. For each MC simulation

at each SNR value, data is generated for 1000 random samples. Sub-space dependence metric

is evaluated for each constraint matrix and is averaged at each SNR value. These metric values

are reported in Figure 2.3, it can be observed from the figure that including process information

available can improve the estimates.

26
Figure 2.3 Comparison of model estimates by sPCA and PCA at different SNRs

2.5.2 Case study 2

The system considered in this case study is the steam melting procedure, which is considered

by many researchers for testing data reconciliation and gross error detection approaches [31],

[42], [43]. The network contains 28 flow variables and 11 flow constraints. The data is

generated by varying 17 flows (F4, F6, F10, F11, F13, F14, F16 - F22, F24, F26 - F28)

independently using a first order ARX model for 1000 time samples, the flow rates of

remaining flows are obtained by using the flow constraints at each time sample. The flowsheet

of the steam melting process can be seen in Figure 2.4.

Assuming the structure of the plant is known, flow constraint matrix is estimated using both

PCA and sPCA for 1000 runs of each SNR value. The mean closeness measure of the

constructed constraint matrices to the original matrix for different SNR values are provided in

27
Figure 2.5. It is interesting to note that except for SNR 10, sPCA delivers better estimates than

PCA in all 1000 runs.

Figure 2.4 Flow network of steam melting process for methanol synthesis plant

Figure 2.5 Comparison of model estimates of steam melting process at different SNRs

28
2.5.3 Case study 3

We intend to show the supremacy of model estimates obtained by sPCA algorithm in this

simulation study. We consider the system with constraint model mentioned in Equation(1.41).

The model is assumed to be 𝐴0 𝑋 = 0 where

 3 1 1 2 0 6  * * * * 0 *
 2 1 2 1 0 0  * * * * 0 0 
A0   structure  A0    (1.49)
1 1 1 0 0 0 * * * 0 0 0
   
1 3 1 1 0 0  * * * * 0 0

It should be noted that the structure of 𝐴0 in Equation(1.49) matches with structure specified

in Equation(1.41). Data is generated with the same procedure followed in section 2.4.1. We

perform MC simulations of 100 runs at various signal to noise ratio (SNRs) to demonstrate the

goodness of estimates obtained by proposed algorithm – sPCA. For the purpose of comparison,

the model was estimated from the PCA algorithm too and subspace dependence metric defined

in Equation(1.32) is used to evaluate the quality of obtained estimates. The averaged subspace

dependence metric for both algorithms at each SNR value can be seen in Figure 2.6.

From the plot, it can be easily noticed that sPCA outperforms PCA at SNR above 50. In

order to improve the performance at other noise levels, we propose the combination of cPCA

and sPCA in the next section. The key idea is to incorporate for the constraints with similar

structure during model estimation. For example, row 2 and 4 in Equation(1.49) have the same

structure. This information is used to modify the proposed algorithm slightly.

29
Figure 2.6 Comparison of model estimates at different SNRs for case study 3

30
2.6 Constraint Structural PCA

Structural PCA performed better than PCA when the structural information of the network

is known. cPCA also showed better performance compared to PCA when one or more true

equations information is known (or obtained). In this section, we propose a combination of

cPCA and sPCA algorithms, termed as CSPCA, which shows improvement over sPCA. We

have discussed the approach of estimating each linear relation corresponding to a structure

separately in section 2.5. All these linear relations were estimated independently in a sequential

manner. The key idea in this section is to utilize the information derived up to the 𝑖 − 1𝑡ℎ row

of the model for estimating the 𝑖 𝑡ℎ row.

This combined algorithm can be utilized in presence of repeated constraints (i.e. two or more

equations involving the same set of variables) or sub-structured constraints (i.e. the variables

set involved in an equation is a subset of the variables set involved in another equation) in the

structural information that is available. It is interesting to note that in the absence of repeated

or sub-structured constraints in the structural information provided this algorithm results same

as sPCA. The pseudo code of the algorithms is as follows:

1. Arrange the constraints in ascending order of the variables that are involved in individual

equations.

2. For all constraints 1 to 𝑚, identify the variables set 𝜑𝑖 that are active in each constraint i.e.

𝜑𝑖 = {𝑗 | 𝐴(𝑖, 𝑗) ≠ 0}.

3. Now for each constraint 𝑖, identify the constraints (𝑗 from 1 to 𝑖 − 1) such that 𝜑𝑗 is a subset

of 𝜑𝑖 and store the sub-structured constraints indices set 𝜓𝑖 i.e. 𝜓𝑖 = {𝑗 | 𝜑𝑗 ⊆ 𝜑𝑖 ; ∀𝑗 =

{1, … , (𝑖 − 1)}}

31
4. Now for each constraint 𝑖, if the sub-structured constraints indices set 𝜓𝑖 is empty then label

the equation as 𝑆 else 𝐶 i.e. 𝐿𝑎𝑏𝑒𝑙𝑖 = {𝑆: |𝜓𝑖 | = 0 𝑒𝑙𝑠𝑒 𝐶}

5. Now, for all the constraints that are labeled as 𝑆 estimate the constraints using sPCA by

using structural information of individual constraints.

6. Now, for all the constraints that are labeled as 𝐶 estimate the constraints using cPCA,

assuming the estimated constraints set in 𝜓𝑖 as known.

7. Rearrange the equations in the given order and report the final estimated A

Steps 1-4 in the above algorithm are performed to detect the constraints which could be

identified using sPCA and cPCA. For the case study described in section 2.5.3, steps 1-4 are

performed and tabulated in Table 2.1 for a better understanding of the proposed CSPCA

algorithm. The efficacy of the proposed algorithm on the case study described in section 2.5.3

is provided in Figure 2.7. It can be observed from Figure 2.7 that CSPCA outperforms sPCA

and PCA at various noise levels.

Table 2.1 Understanding steps 1-4 of the CSPCA algorithm for case study 2.5.3

Rearranged Sub-structured
Constraint Variables set 𝜑𝑖 Label
index constraints set 𝜓𝑖
1 [1, 1, -1, 0, 0, 0] {1,2,3} {} 𝑆
2 [2, 1, -2, 1, 0, 0] {1,2,3,4} {1} 𝐶
3 [1, -3, 1, 1, 0, 0] {1,2,3,4} {1,2} 𝐶
4 [3, 1, -1, 2, 0, -6] {1,2,3,4,6} {1,2,3} 𝐶

32
Figure 2.7 Comparison of PCA variants performance at different SNRs for case study 3

2.6.1 ECC Case study

This system is a simplified version of the Eastman Chemical Company benchmark case

study to test process control and testing methods [44]. It involves 10 flows and 6 flow

constraints, hence the data is generated by varying 4 flows (F1, F5, F7, and F8) for 1000 time

samples. F1 and F2 are mixed streams of reactants A and B with different compositions, F9

and F10 are pure streams of reactant A and B respectively. F3 is a product stream with excess

reactants A and B, which are separated using a separator. F4 is a pure product stream, whereas

F9 and F10 are recycle streams of components A and B. The flow network along with the flow

constraints can be observed from Figure 2.8.

33
Figure 2.8 Flow network of simplified ECC benchmark case study

The last flow constraint is a material balance constraint of component A at J1. Assuming

the structure of the process is known, the flow constraint matrix is estimated for 1000 runs of

MC simulations using PCA, sPCA, and CSPCA for different SNRs. The sub-space dependence

metric of the constructed constraint matrices to the original matrix for different SNR values are

provided in Figure 2.9 along with the number of times an algorithm have better comparison

metric.

The flow constraint matrices constructed using different algorithms tested to identify the

faults in the flow rates of all flows. For illustration, if the flow rates at particular time violate

the constraint matrices (sum of the residuals) within a tolerance limit then the sample

considered to be faulty. For different SNR values (10, 20, 50, 100, 200, 500, 1000 and 5000),

randomly 50 noise added data samples are selected and in each sample, one of the variables is

randomly modified to make the sample faulty. The flow constraint matrices obtained for the

1000 runs of MC simulations for each SNR value are averaged and considered as the final set

of flow constraints. The final set of flow constraints obtained using proposed approaches have

been tested to identify the faults with a tolerance limit as 1. The number of original faults,

which are obtained using the original constraint matrix for the same tolerance are reported in

Figure 2.10 along with the number of faults identified using proposed approaches. It can be

observed from the figure that CSPCA performs better than sPCA, which is superior to PCA.

34
Figure 2.9 Comparison of PCA variants performance at different SNRs for ECC case study

35
Figure 2.10 Comparison of PCA variants performance for fault detection

2.7 Conclusion

In this study, we have formulated model identification schemes, of process models with

known structure. To the best of our knowledge, this is the first time such a scheme has been

proposed. Implementation of the techniques in a few cases suggest an improvement over

conventional PCA. Any modification to the process knowledge can be incorporated by either

sub-selection of variables or reducing the dimensionality of the variable set by using the said

process knowledge. We also proposed the model identification algorithm for the case when

few of the linear relations are known apriori. This was termed as constrained PCA. We

proposed the combination of cPCA and sPCA which provided further improvement in

performance as compared to vanilla PCA and sPCA. The key idea in the integration of two

algorithms was to use the information provided by previously estimated linear relations for

36
estimating further relations. We have also provided general guidelines about the applicability

of the combined algorithm.

In this chapter, we proposed different ways of incorporating process information such asa

subset of constraints or sparse information of whole constraint matrix into the PCA framework

for better model estimates. Sparsity information of a chemical process can be obtained using

first principles models i.e. mass and energy balances. We also demonstrated that the proposed

approaches can identify faults effectively compared to traditional PCA. In the next chapter, we

generalize a first principles derived model to estimate drug solubility in binary systems using

machine learning approaches.

37
CHAPTER 3

Generalization of first principles derived model using machine

learning approaches to predict drug solubility in binary systems

Drug solubility is a major concern in pharmaceutics, for both drug delivery and discovery.

In drug delivery, it is important to achieve the desired concentration of drug in circulation for

achieving a required pharmacological response. Since the drug can reach the receptors through

aqueous media, aqueous soluble drugs are preferred for clinical purposes due to the concern of

oral bioavailability. Solubility also plays a crucial role in discovery and development

investigations. In the last decade, the number of drugs which fail in commercialization is

increasing due to their low aqueous solubility values [45]. Owing to this fact, several

approaches have been proposed to increase the drug solubility such as usage of cosolvents [46],

Deep Eutectic Solvents (DES) [47], solubilizing agents [48], pharmaceutical salts and co-

crystals [49], and various other techniques [50], [51]. Di et al. [52] and Williams et al. [53]

highlighted the major challenges faced by low solubility drugs such as vitro and vivo

assessments during drug discovery and development phases and suggested some possible

remedies.

3.1 Literature survey

Jorgensen and Duffy [54] made an attempt to model the aqueous solubility of drugs as a

linear function of features such as solvent-accessible surface area, number of hydrogen bonds,

etc., which are obtained using Monte Carlo simulations. In a follow-up paper [55], they quoted

three different ideas to predict the aqueous solubility of drugs. First was a group contribution

38
method, where linear regression is performed on available data to obtain contributions of each

fragment. Second, a linear regression based QSPR approach where features are obtained from

a chemical structure is proposed. Third, a neural network (NN) based approach which accounts

for non-linear behavior of features in QSPR approach is evaluated. Though NN based approach

allows identification of non-linear behavior, the drawback is that the internal processing of the

NN is not lucid. Ran and Yalkowsky [56] verified the effectiveness of general solubility

equation (GSE) to predict aqueous solubility, which merely requires information of the melting

point and octanol-water partition coefficient of a drug to predict its aqueous solubility. Delaney

[57] reviewed several approaches to predict aqueous solubility using structural information of

the solute and highlighted the challenges in their applicability. Lusci et al. [58] demonstrated

the benefits of machine learning techniques such as deep learning networks over other state-

of-the-art methods used to predict the aqueous solubility of drugs.

Cosolvency is one of the most feasible solutions for increasing aqueous solubility of drugs

[46]. Cosolvency of non-aqueous solvents is also important during the drug development phase

[59]. The major advantage of mixed solvent systems is that the solvent-solvent interactions in

some compositions may allow more solute to dissolve than the single solvent systems [60].

One of the earliest models to predict drug solubility in water-cosolvent systems was proposed

by Yalkowsky [61]. Drawing inspiration from the thermodynamic mixing model, Acree [62]

derived a mathematical model to predict solute solubility in binary solvent systems at a fixed

temperature when the solubility values in pure solvents are available. Jouyban and Hanaee [63]

provided a better strategy for regressing the variables specified in the model proposed by Acree

[62] with a no intercept linear model. This mixed solvent model is further extended by Jouyban

and Acree [15] for varying temperatures. Chen and Song [64] proposed a Nonrandom Two-

Liquid Segment Activity Coefficient (NRTL-SAC) model to estimate drug solubility in pure

and mixed solvents using the molecular descriptors such as hydrophobicity, polarity, and

39
hydrophilicity in a thermodynamic framework. Mullins et al. [65] proposed COSMO-based

(Conductor-like Screening Models) thermodynamic models to predict solubility in both pure

and mixed solvent systems. Sheikholeslamzadeh and Rohani [66] estimated the solubility of

three different solutes in different mixed solvents systems using both experimental and

computational studies and concluded that the NRTL-SAC model performs better than the other

thermodynamic frameworks. Kokitkar and Plocharczyk [67] applied the NRTL-SAC model to

identify optimal solvents to support crystallization process. Shu and Lin [68] improved the

efficacy of COSMO-SAC (segment activity coefficient) model by minimizing the error in the

prediction of solute-solvent interactions using the solubility data in pure solvents. Valavi et al.

[69] extended the NRTL-SAC model by introducing temperature dependent binary interaction

parameters, which results in better prediction of solubility values than the original NRTL-SAC

model.

Jouyban et al. [70] proposed a cosolvency model by consolidating different cosolvency

models proposed as a power series of the volume fraction of the cosolvent but with different

assumptions. Jouyban [59] compared several cosolvency models based on multiple prediction

accuracy criteria such as root mean square error (RMSE), mean percentage deviation (MPD),

etc. and concluded that Jouyban-Acree model [15] to be the most suitable approach for

pharmaceutical purposes. Jouyban et al. [71] made an attempt to generalize the Jouyban-Acree

model with Abraham’s solvent and solute parameters to predict drug solubility in some water-

cosolvent binary systems. This solute generalization approach considers solvents to be fixed,

i.e., one model for each binary solvent system regardless of the solute used. A detailed review

of state-of-the-art methods for estimating drug solubility using both experimental and

computational studies have been consolidated by Jouyban and Fakhree [72].

QSPR is a mathematical relationship identified between the physical response of a molecule

and its structural information. Structural information is denoted in the form of

40
descriptors/features which are numerical values associated with the chemical constitution of a

molecule structure ranging from atom counts to topological surface area. QSPR approaches

proved their ability to predict various physical properties of a molecule from its structural

information [73]. Identifying QSPR involves three major steps, i.e., data preparation, data

processing, and model interpretation. Data preparation involves the conversion of chemical

structures into a suitable form to calculate structural feature values. Structural features can be

obtained from derived mathematical models, experimental analysis and various platforms

designed for this purpose such as PaDEL-Descriptor, DRAGON, OpenBabel [74], etc. Data

processing is used for the removal of intercorrelated features and identifying optimal feature

set using any feature selection algorithm such as genetic algorithm, stepwise algorithm, etc.

Feature selection algorithms are intelligent ways of exploring various possible combinations

of overall feature set to obtain the most suitable subset of features [75]. Identified feature subset

is exposed using any linear or non-linear modeling toolbox to obtain the relationship between

features and response. Interpretation of model requires knowledge about the behavior of

features and the response [74]. Roy et al. [76] provided an overview of QSPR/QSAR modeling

by consolidating the details of various structural descriptors along with the QSAR applications.

Yousefinejad and Hemmateenejad [77] reviewed various chemometric approaches used for

features selection and model development for QSPR studies.

Selecting a suitable mathematical tool for identifying linear or non-linear behavior is not a

trivial task. Multiple linear regression (MLR) and partial least squares are the most efficient

methods to predict linear behavior, whereas neural networks and support vectors are efficient

to predict non-linear behavior. The disadvantage of neural networks is that the model will not

be available explicitly. Multiple model identification is one of the alternatives to identify non-

linear behavior, using piecewise linear models [78]. A detailed review of the state-of-the-art of

multiple model framework can be obtained in the recent two-part review [79], [80]. Among

41
several cosolvency models that have been proposed to predict drug solubility, Jouyban-Acree

model [15] draws greater attention due to its efficacy in predicting drug solubility in numerous

binary solvent systems at different temperatures. Although this model form is unique, model

constants vary for each combination of solute and binary solvent. Correlating drug solubility

to structural features of the solute and both solvents through the Jouyban-Acree model can lead

to a universal model to predict drug solubility in any binary solvent system at any temperature.

In the current work, QSPR approach is used to correlate drug solubility in binary solvent

systems to structural features such as molar refractivity, molecular weight, McGowan volume,

etc. of solute and both solvents using a modified version of Jouyban-Acree model. A brief

review of various drug solubility prediction methods is provided in section 1, while data

preparation and processing steps are discussed in section 2. For feature selection, genetic

algorithm is used on selected features to obtain an independent feature set. In section 3, a linear

dependency between drug solubility and identified feature set is anticipated and model

coefficients are obtained using MLR and also a weight-based optimization. In section 4, a

piecewise linear dependency of drug solubility on identified features is assumed and model

coefficients are obtained using a modified prediction error based clustering approach. Finally,

this article is concluded with comments on the efficiency of proposed models on drug solubility

data collected from the literature.

3.2 Data preparation and processing

Experimental drug solubility data of 63 diverse binary systems with varying solute and

solvents is consolidated from various resources. In twenty-five out of sixty-three systems, the

solubility data is obtained from water-cosolvent systems for different solutes, whereas it is

obtained from non-aqueous solvent systems for the remaining. The data contains twenty-seven

different solutes, a majority of which are based on anthracene (twelve binary systems). For

42
twelve out of these sixty-three systems, solubility data is obtained at two different temperatures,

whereas the data is obtained at a single temperature for the rest of the systems. Temperatures

in the data collected vary from 293K to 308K. Based on the above observations, we can confirm

that the data collected is well distributed over different combinations of solute, solvents, and

temperature and hence is useful for obtaining an acceptable model to predict drug solubility.

The experimental data collected consists of 766 data samples in which 150 samples belong to

pure solubility estimations. Jouyban-Acree model is designed in such a way that pure solubility

values are always predicted at their experimental values irrespective of the model parameters

used. Hence pure solubility samples are not considered for the mixed solubility prediction case

studies. The data is further screened such that no data sample considered in this case study has

a solute mole fraction less than 0.0001, thus reducing data samples count to 585.

PaDEL-Descriptor [81] is an open source software to calculate descriptors of different

categories ranging from constitutional descriptors to electrostatic descriptors. Structure files of

all the 49 compounds involved in the data are generated in smiles (.smi) format using

MarvinSkecth [82]. Structural files generated are processed using PaDEL-Descriptor to obtain

the structural features. Molar refractivity (AMR), McGowan characteristic volume

(McG_Vol), van der Waals volume (VABC), Molecular weight (MW), sum of atomic

polarizabilities (Apol) and first ionization potentials (Si) of both solvents and solute along with

topological polar surface area (TopoPSA), solvent accessible surface area (SolvAccSA),

excessive molar refraction (MLFER_E), combined polarizability (MLFER_S), overall solute

hydrogen bond acidity (MLFER_A) and basicity (MLFER_BH ) values of solute are selected

to account for the solvent-solvent and solute-solvent interactions [72]. All the 24 descriptors

are of different magnitudes and hence descriptors are scaled using their individual mean and

standard deviation values (mean centric scaling). The Jouyban-Acree model is modified by

43
normalizing the temperature term with room temperature and forms the basis for further

investigation.

 2 i  T 
ln X sT   f1 ln X1T  f 2 ln X 2T   f1 f 2   Qi  f1  f 2     (2.1)
 i 0   298 

Where, X sT the solubility (in mole fraction) of the solute in the system at temperature T, f1 and

f2 are the mole fractions of solvents in the solvent mixture, X 1T and X 2T are the solute

solubility values (in mole fractions) in pure solvents at temperature T in increasing fashion (i.e.

the solvent which has low solute solubility considered as solvent 1), Q0 , Q1 and Q2 are the

Jouyban-Acree model constants, which are dependent upon the solute and solvents involved in

the system. In this work, these model constants are assumed to be linearly dependent on

selected structural features. Assume Qi as a linear function of selected features, it can be

expressed as,

N
Qi   ci , j F j ; i  0,1, 2; (2.2)
j 1

Where, Fj is the jth structural feature value and N is the number of structural features. This

now makes it possible to generalize the model to be able to predict solubilities for novel

systems, systems that are not used in obtaining the model. We now perform some algebraic

manipulations to render the equations in a standard multivariate linear model form.

Substituting Equation (2.2) in Equation (2.1):

 2  N  i  T 
ln X sT   f1 ln X 1T  f 2 ln X 2T   f1 f 2     ci , j Fj   f1  f 2    (2.3)
 i 0 j 1   298 
   

Rearranging Equation (2.3), and expand the right-hand side terms as follows:

 N N

  c0, j F j   c1, j Fj  f1  f 2  
 ln X   f ln X
T
s 1
T
1  f 2 ln X 2T    f1 f 2 

j 1
N
j 1
 T 
  298 
(2.4)
  c2, j Fj  f1  f 2 
2
 
 j 1 

44
Convert Equation (2.4) in the form of a multivariate linear model as follows:

 
Y  ln X sT   f1 ln X 1T  f 2 ln X 2T    c0, j j   c1, j  j   c2, j  j ;
N N N
(2.5)
j 1 j 1 j 1

 T 
where  j  f1 f 2   F j ;  j   j  f1  f 2  ;  j   j  f1  f 2 
 298 

Y is the difference between actual log solute solubility fraction in solvent mixture and sum of

the products of pure log solubility values and their respective mole fractions in the solvent

mixture. For the 585 data samples considered in this study, once values of Y , j ,  j , and  j

are known any regression technique can be applied to obtain regression coefficients ( ci , j ).

3.3 Feature selection

Feature selection is the process of identifying the best possible combination of features using

a predefined metric. Data is divided into 5 equal size partitions such that all partitions contain

a minimum of 10% data points from all 63 binary systems. Values of Y , j ,  j , and  j are

calculated using scaled structural features and the solubility data based on Equation (2.5).

Genetic algorithm (GA) is used to select the optimal feature set. Features are selected by

combining K-fold (K as 5) validation with GA. The feature selection procedure is executed for

five folds such that each time the data in four different partitions are used to obtain regression

coefficients ci , j . Variables for GA are binary ( b j ) i.e. whether a particular feature is selected

or not and the objective is weight based RMSE of the whole data. Since log solubility

predictions are biased towards low solubility data samples and general solubility predictions

favor high solubility samples, a weighted objective of both predictions is considered to be more

effective.

The weighted objective for optimization problem can be expressed as follows:

45
  
 
0.5

  N D  X n  X n    n   n 
 N D ln X orig  ln X pred 
0.5 2
pred 2

orig

min  10    
 (2.6)
   n 1 
bj
  n 1 ND
  ND  
   

Where N D is the number of data points and ln  X npred  can be estimated as follows:

 2  N  i  T 
ln X sT   f1 ln X 1T  f 2 ln X 2T   f1 f 2     ci , j b j Fj   f1  f 2    (2.7)
 i 0 j 1   298 
   

Where b j is a binary decision variable representing whether the jth structural feature is selected

or not. Regression coefficients ( ci , j ) are obtained using MLR on training data based on

Equation (2.5) with the features for which b j is one. After obtaining optimal binary variables

set for each fold using GA, the final set of features are selected with a 60% probability from

overall results i.e. if a particular feature is active for at least 3 folds out of 5 folds then that

feature is selected into the final feature set. The feature selection approach explained above is

employed on consolidated drug solubility data (585 data samples) using the inbuilt ga function

in MATLAB. Average and best fitness values at each generation for all the folds can be seen

in Figure 3.1. The details of the optimal features obtained in each fold along with different

prediction efficacy metrics are provided in Table 3.1.

In fold 1 and fold 5, optimal solutions denote that all features are essential for drug solubility

predictions. Fold 2 has the least active number of variables whereas variable 22 and variable 6

are not active in case of fold 3 and fold 4 respectively. It is interesting to note that in the case

of fold 3 and fold 4 (bolded values) leaving out certain variables doesn’t contribute

significantly to the optimal objective. It can be observed from Figure 3.1 that for fold 1 and

fold 5 the best objective value did not change throughout the generations.

46
Figure 3.1 Generation-wise best and average fitness values for all the folds

Table 3.1 Details of feature selection process using GA

Fold Features that are not Weighted objective MPD (solubility) R2 metric R2 metric (log
active in optimal (solubility) solubility)
solution
Optimal features All features Opt. All Opt. All Opt. All
1 - 1.4963 1.4963 49.028 49.028 0.311 0.311 0.601 0.601
2 1, 6, 9, 12, 19, 22, 23, 24 1.5549 1.8546 51.166 49.613 0.345 0.171 0.545 0.322
3 22 1.5272 1.5281 51.610 51.483 0.289 0.289 0.582 0.581
4 6 1.5094 1.5096 50.223 50.773 0.326 0.297 0.586 0.594
5 - 1.4883 1.4883 50.899 50.899 0.336 0.336 0.600 0.600
* Tuned parameters: ConstraintTolerance and FunctionTolerance set to 0 and also provided an initial guess that all variables
are 1 (i.e. active). Remaining parameters set to MATLAB default.

47
Since all variables are active (the solution coding corresponding to all variables are 1) is

supplied as an initial guess at zeroth generation, this means that all variables are important for

these two folds. Though GA terminated at different generations for different folds, the reason

for termination in all folds is that the optimal solution did not change for 50 consecutive

generations. R2 values of general and log solubility predictions represent that the optimal

solutions are more efficient than models consisting of all features. On the other hand, MPD

values for the models obtained by considering all features are better than the MPD values

obtained for optimal solutions. This is because the MPD metric is sensitive to predictions of

data samples with low solubility values. From the results obtained, it is evident that all features

can be selected for further investigation of solubility prediction using modified Jouyban-Acree

model.

3.4 Single model approximations

In this section, a single linear model i.e. modified version of the Jouyban-Acree model

given in Equation (2.3) is investigated to predict the drug solubility in binary solvent systems

through two approaches. The first approach is to obtain model coefficients using ordinary least

squares (OLS) and the second approach is to use a weight-based optimization approach (WBO)

to obtain model coefficients. For this case study, the data is divided into five equal size

partitions such that all partitions contain a minimum of 10% data points from all 63 binary

systems. The efficacy of single model assumptions is validated using K-fold validation, for a

K value of 5. This validation procedure repeats for five (K) times, each time the data in four

(K-1) different partitions are used to obtain model coefficients whereas the data in the

remaining partition is used to test the efficacy of obtained coefficients. Once this procedure is

completed, the average of coefficients over K folds are obtained and reported as final model

coefficients. The weight-based optimization is carried out for the objective specified in

48
Equation (2.8) using quasi-newton algorithm (fminunc solver in MATLAB) with an initial

guess of zero for all variables. The objective of weighted optimization is as follows:

  
 
0.5

  NTrD  X n  X n   n   n 
 NTrD ln X orig  ln X pred 
0.5

2
pred 2

orig

min  10     
 (2.8)
cij  NTrD   n 1 NTrD 
  n 1    
 

Where, ci , j are variables for the optimization problem and NTrD is the number of data points

in the training dataset. The logarithmic solubility of the solute is estimated using Equation(2.3)

. Various metrics for testing the efficacy of model coefficients obtained using both approaches

are provided in Table 3.2. The drug solubility values obtained using the averaged model

coefficients over all folds in both approaches are plotted along with log solubility predictions.

Parity plots for both types of predictions can be seen in Figure 3.2.

Figure 3.2 Parity plots for general and log solubility predictions using both single model
approaches

49
3.5 Multiple model approximation

The reasons for the poor prediction for linear models might be that a single model cannot

capture the behavior for a wide variety of systems. There might be characteristics of systems

that group them together, in which case it might not be possible to identify a single global

model for these systems. To test this hypothesis and develop a model of higher fidelity, in this

section, logarithmic solubility of solute is assumed to be piecewise linearly dependent on

structural features i.e. the operating model will be different but linear in different regions of

feature space. Identification of models of this form is popularly referred to as multiple model

learning (MML). Consider the example depicted in Figure 3.3. From the figure, it can be seen

that the output can be characterized using linear relationships that are different in different

regions of input space. Identifying a single linear model throughout the input space will result

in poor approximation. MML has the possibility of improving the prediction accuracy if the

approach automatically identifies the different linearly operating models (four in this case) and

their operating regimes.

Figure 3.3 Multiple linear models underlying in different input partitions of data

50
An MML approach that was recently proposed based on prediction error based clustering

[78], [83] is explored in this work. To examine the robustness of piecewise linear models, a

two-layer testing approach is used. The data is divided into five equal size partitions such that

all partitions contain a minimum of 10% data points from all 63 binary systems. The data in

four partitions are used to identify linear models using K-fold validation approach. The leftover

data is set aside as the global test set and used in the second layer testing. In each fold, data in

three (K-1) different partitions are used to obtain multiple linear models and data in the

remaining partition (K-fold test set) is used to test the obtained models in the first layer testing.

Unlike in a single model scenario, predicting the output of a new data point using multiple

models is not trivial. K nearest neighbors (KNN) is the most frequently used testing approach

to find which model should be used to predict the output of a new data point. Since a prediction

error (PE) based clustering approach is used to identify multiple models, a new strategy is

proposed to identify a suitable model to predict the output of a new data sample. The details of

the clustering approach along with the proposed testing strategy are provided below.

Kuppuraj and Rengaswamy [78] proposed a prediction error based fuzzy clustering

approach to identify underlying multiple linear models in any given data. The advantage of

prediction error based approaches over the traditional Euclidian distance based clustering

approaches is that samples are grouped based on their response to output variables thus

reducing misclassifications at boundaries. For the sake of brevity, the steps of the algorithm

are provided below. More insights into the PE based algorithm can be obtained from our

previous work [78]. The objective of PE based clustering algorithm for grouping M data

samples into N models is as follows:

1 N M q 2
f     ij y j  Ci x j 2 
2 i 1  j 1
(2.9)

51
Where y j , x j are the response (output) and features vector (input variables) of sample j

respectively. Ci represents the model parameters vector of cluster i such that yˆ j  Ci x j .

1. Initialize N models with different parameter values. Each cluster represents a model

with different parameter values.

2. Calculate the prediction error of sample j with respect to cluster i as follows:

PEij  y j  Ci x j (2.10)
2

3. Compute membership of sample j to cluster i as follows:

1 i  1,2, N ; j  1,2, M ;
ij  ; (2.11)
N  PEij 
2
q 1 N  clusters, M  samples
 
k 1  PEkj


4. Update cluster centers (model parameters) as follows:

Cir 1  Cir   r g r (2.12)

It is a line search optimization, where search direction can be estimated as follows:

f  M q 
g r     ij  y j  Cir x j  xTj  (2.13)
Ci  j 1 

The step length of the search can be estimated as follows:

   g x   y  Cir x j 
M N T
q r
ij j j
j 1 i 1
r  (2.14)
   g x   g x 
M N T
q r r
ij j j
j 1 i 1

5. Calculate new prediction errors

6. Calculate root mean square error (RMSE), where b is the best model fit for sample j


M
j 1
PEb2 j
RMSE  (2.15)
M

52
7. Terminate based on a criterion (RMSE less than predefined limit or number of iterations

exceeds the limit) and go to next step else go to step 3

8. Merge like models based on a cosine angle metric and obtain new model parameters

using OLS. Finally, report final models and input partitions.

 Ci CkT 
The cosine angle metric between clusters i and k: ik  cos1   (2.16)
 Ci Ck 

Testing strategy:

Traditional clustering algorithms operate based on Euclidian distance; hence, KNN is a

suitable approach to identify an appropriate model for a new data (test) sample. In KNN, first,

the Euclidian distances from the test sample to all samples in the training data are evaluated.

Then the nearest K samples to test sample and their corresponding models are identified. Now,

the model containing the highest number of samples from the nearest K samples is considered

to be a suitable model for the test sample. The motivation behind prediction error based

clustering is to accurately classify data samples that are close in variable space but are

characterized by different models. In such cases finding a suitable model for a test sample using

traditional KNN may not be effective. In the proposed strategy we incorporate prediction error

along with KNN. First, the Euclidian distance from the test sample to all samples in the training

data are evaluated. Next, the K nearest samples for the test sample are identified. Next,

prediction errors for these K nearest samples with respect to each model are computed and

averaged over the models. The model with the least averaged prediction error (for the K nearest

samples) is considered the most suitable model for the test sample.

The proposed testing strategy is illustrated with an example of a single input single output

system, which comprises two different linear models in the operating region. It can be observed

from Figure 3.4 that there exist two models in the plotted data. All the blue colored data points

belong to model 1, which is represented by the blue line, whereas the red data points and red

53
line represent model 2. If traditional KNN approach with a K value as 3 is used to obtain the

suitable model for a new test sample with input value as 5.3 (represented in black diamond),

model 2 (red line) will be selected as the suitable model for the test sample since out of the

three nearest samples (S1, S2 and S3), two belong to model 2. If the proposed testing strategy

is used, first the prediction errors for the three nearest samples using both models will be

calculated. The average prediction error for both the models will be evaluated and the model

with least average prediction error will be identified as the suitable model. From Figure 3.4, it

can be observed that model 1 has the least mean prediction error hence it is the suitable model

for the new test sample with input value 5.3.

Drug solubility estimation using multiple models:

The prediction error based clustering algorithm works on the principle that piecewise linear

models of the form yˆ  Ci x characterizes the data in different input regions. Hence, once the

partitions are identified, final model coefficients are obtained using OLS. As discussed earlier,

in case of single model approximations, the linear model obtained using OLS favor low

solubility samples. Hence, a weighted objective was incorporated into PE based clustering to

obtain models without any bias towards any particular range of solubility samples. In the case

of single model approximations, it is also observed that replacing 20% of data every time for a

new fold results in a significant deviation in model parameters. In the case of multiple models,

these deviations can result in ambiguity when the final set of models is reported. To address

these two issues, the PE based clustering approach is modified such that final models are

obtained using a weight based optimization.

54
Figure 3.4 Prediction error based Knn strategy to identify a suitable model for a test molecule

55
Step 1 in the PE algorithm is modified such that the models are initialized with the final

model parameters obtained in the previous fold. To incorporate the weighted objective, final

models in step 8 are obtained using optimization instead of OLS. This optimization is carried

out using quasi-newton algorithm (fminunc solver in MATLAB) for the objective specified in

Equation(2.8). The final parameters obtained in the previous fold are used as an initial guess in

the current fold so that the models obtained in all the folds will be consistent. This clustering

procedure is repeated until models overall folds converge within a predefined similarity metric.

The similarity metric is explained in detail with a simple example. Let’s assume that there exist

N piecewise linear models in data and obtain model parameters in all K folds. Now compute

the cosine angle metric provided in Equation (2.16) for each model with respect to the other

models in remaining folds. Now, for the model i in the kth fold find the minimum angle in

between the N models associated with any other fold, repeat the same until minimum angles

with respect to all the other folds for that particular model are obtained. Now, identify the

maximum angle i , k  among the K  1 minimum angles obtained for the model i of fold k.

Repeat the above procedure for all the models in all the folds. The maximum angle among the

angles i , k  that are associated with all the models in all the folds is defined as the similarity

metric   and denoted as follows:

  max i , k  ; i  N ; k  K (2.17)

where    i, j 
i ,k  max min  k ' ; j  N ; k '  K  k '  k ; (2.18)

The flow chart of the modified PE based algorithm is provided in Figure 3.5. In case of fold

1 in phase 1, since no previous fold exists, models are initialized with random parameters,

whereas in step 8 the optimization is initiated with all variables as zeros as the initial guess.

The drug solubility prediction is carried out assuming that two piecewise linear models exist

56
in the data and the fuzzifier ‘q’ value is set to 1.5. The maximum number of iterations for PE

based clustering algorithm is set to 1000, whereas tolerance for similarity metric   is set to

10o. The models in each fold are tested on two test data sets i.e. K-fold test set and the global

test set. The suitable model for a data sample in the test set is identified using the proposed

testing strategy with K value as five i.e. the model, which has the least averaged prediction

error over the five nearest samples to the test sample is considered as the best suitable model

for that particular test sample. Similar models are identified in all the folds and are averaged to

test with the global test set.

The data separated out for K-fold validation is associated with the averaged models based

on the prediction errors, so that any new sample can use these data samples as neighbors for

selecting a suitable model. This final pair of models can be considered as the best models to

predict drug solubility in binary solvent systems irrespective of the temperature and

components involved in the system. The drug solubility profiles estimated using these model

coefficients for two different systems at different temperatures are plotted in Figure 3.7.

Solubility profiles estimated using the final models obtained from the single model

approximations are also included in this figure to show the efficacy of multiple models

explicitly. The details of prediction accuracy using various metrics for the models obtained in

the final phase are reported in Table 3.3. It can be observed from the table that the R2 values

for both general and log solubility predictions in all the folds are significantly improved using

multiple model approximations. It is interesting to note that R2 values for both general and log

solubility predictions are similar for all data sets in all the folds thus showing the effectiveness

of the weighted objective approach, i.e., there is no bias towards any category of samples,

unlike OLS approach.

57
The two-layer testing statistics prove the robustness of obtained models. The solubility

prediction MPD values of K-fold data and global test data set using the averaged parameters

(AP) models are significantly low underlining the fact that this final pair of models can be used

for drug solubility predictions in any binary solvent systems. It is evident from all three efficacy

metrics for K fold test data (K-test) in case of fold 3 (bolded values) multiple models

underperform, still much better than any single model metric. This can be attributed to the few

misclassifications of low solubility samples in the testing phase. Removing these outliers

improve the performance of multiple models.

To benchmark the performance of the multiple model approach with other popular

techniques, we trained a neural network (NN) to identify non-linear behavior. The neural

network was trained using a Levenberg-Marquardt backpropagation training algorithm for 3

different configurations i.e. the number of hidden layers used are 1, 2 and 5. The data was

divided into five equal size partitions such that all partitions contain a minimum of 10% data

points from all 63 binary systems. The data in four partitions are used to train the neural

network, whereas the remaining data used for testing. The network is trained using 'trainlm'

function available in MATLAB neural network toolbox. The efficacy metrics of three different

NN configurations are as follows – MPD values are 35.495, 36.296, and 44.001; R2 metric

values for solubility are 0.597, 0.900, and 0.838; R2 metric values for log solubility are 0.888,

0.814, and 0.884 correspondingly for hidden layer size of 1, 2 and 5. The superior performance

of the NN approach compared to single linear model approaches can be attributed to the ability

of neural networks to identify the non-linear behavior. Though the performance of the NN

approach is better than a single model approaches it still significantly underperforms when

compared to multiple linear model approach. To test the efficacy of the proposed approach, the

MPD values obtained are compared to the MPD values reported in the existing approaches

[61], [84]–[86] in Table 3.4.

58
Table 3.2 Various efficacy metrics of obtained models using both single model approaches

Fold OLS model Optimization model


MPD (solubility) R2 metric R2metric MPD (solubility) R2 metric R2 metric
(solubility) (log s) (solubility) (log s)
Train Test Train Test Train Test Train Test Train Test Train Test
1 48.800 49.937 0.346 0.129 0.615 0.538 50.721 52.115 0.590 0.372 0.584 0.522
2 48.518 53.991 0.199 0.083 0.651 -1.07 56.886 60.708 0.754 0.628 0.575 -1.17
3 50.210 56.573 0.288 0.286 0.603 0.496 56.714 64.440 0.721 0.471 0.533 0.380
4 49.685 55.125 0.291 0.309 0.603 0.556 55.331 59.399 0.700 0.680 0.517 0.444
5 50.659 51.860 0.350 0.229 0.621 0.518 54.164 56.417 0.656 0.442 0.568 0.455
AP 49.3347 0.3291 0.5963 54.3691 0.6472 0.5384

Table 3.3 Various efficacy metrics of multiple models obtained using the modified PE approach

Fold MPD (solubility) R2 metric (solubility) R2 metric (log s)

Train K-test G-test Train K-test G-test Train K-test G-test


1 7.426 22.255 17.327 0.996 0.985 0.969 0.993 0.904 0.927
2 7.440 76.034 18.354 0.995 0.960 0.978 0.990 0.908 0.940
3 7.590 161.670 19.086 0.998 0.905 0.987 0.991 0.872 0.924
4 6.966 31.395 48.825 0.997 0.953 0.958 0.993 0.941 0.890
AP 8.070 18.113 0.9947 0.980 0.991 0.938

59
Figure 3.5 Modified PE based clustering algorithm for drug solubility predictions

60
The reported MPD values of individual systems include pure solubility samples for a

rational comparison with the literature models. In the case of naproxen in ethanol-water at

298k, 3 out of the 11 samples have solubility values less than 0.0001 and hence these samples

are excluded while reporting the MPD values using the proposed approach. In the case of

naproxen in ethanol-water at 303k and Propyl p-hydroxybenzoate in PG - water at 300k, a

similar policy was used while reporting MPD values. Yalkowsky equation (model a) is a zero

parameter model, which is a linear combination (in terms of compositions) of the pure

solubility values of a drug in both solvents. Models b and c correlate solubility parameters to

drug solubility, whereas models d and e (proposed multiple models) are two QSPR based

approaches with different structural features. Systems with high MPD values (>1000 in case

of Yalkowsky equation) represent either the nonlinear interaction of solutes or presence of low

solubility values (0.0001 to 0.001) in the data of that binary system. It is interesting to note that

the MPD values of sulfanilamide in ethanol and water corresponding to models b and c are

poorer than the Yalkowsky equation. We can conclude from Table 3.4 that except for

acetaminophen (paracetamol) in the ethanol-water system, MPD values obtained using the

proposed approach are significantly better than other models.

Figure 3.6 MPD values of all 63 binary systems obtained using multiple models approach

61
Table 3.4 MPD metrics of various water + cosolvent systems using several approaches

Solute Cosolvent T(K) Nd MPDa MPDb MPDc MPDd MPDe


Acetaminophen PG 293 11 115.9 66.9 - - 22.85
Acetaminophen PG 303 11 98.37 56.3 - 5.4 2.80
Acetaminophen Ethanol 298 13 44.78 7.2 18.73 26.8 16.90
Caffeine Ethanol 298 11 60.43 41.6 - 21.1 8.22
Caffeine Ethanol 308 11 64.27 50.7 - - 6.06
Naproxen Ethanol 298 11 1803.2 23.4 - - 7.34(8)
Naproxen Ethanol 303 11 1788.2 20.5 - - 7.40(9)
Sulfanilamide Ethanol 298 12 31.24 43.8 31.95 18.2 6.27
Caffeine Dioxane 298 16 60.19 - 21.85 - 3.77
Methyl p- PG 300 11 920.94 - 18.16 14.4 14.07
hydroxybenzoate
Methyl p- PG 300 11 669.15 - 15.78 16.5 9.02
aminobenzoate
Ethyl p- PG 300 11 1991.0 - 18.71 9.4 9.17
hydroxybenzoate
Ethyl p- PG 300 11 1416.5 - 9.60 18.2 8.71
aminobenzoate
Propyl p- PG 300 11 4739.0 - 19.31 14.1 8.04(10)
hydroxybenzoate
Propyl p- PG 300 11 4021.8 - 9.62 22.6 4.89
aminobenzoate
(a) Yalkowsky equation [61], (b) Solubility prediction using partial solubility parameters
[84], (c) solubility prediction using an artificial neural network [85], (d) Jouyban-Acree
model – effect of solute structure [86], (e) Multiple models approach

62
Figure 3.7 Solubility profiles of two distinct binary systems at various temperatures

63
The MPD values of all 63 individual systems (if the data for a system is obtained at two

different temperatures, then the MPD value is calculated for all the data samples together) are

reported in Figure 3.6. MPD values obtained range from 0.72% to 31.03% with a standard

deviation of 6.45%. The minimum MPD value corresponds to system 24 (i.e. Benzoic acid in

CCl4 and n-Heptane) whereas the maximum MPD value corresponds to system 61 (i.e.

Sulfamethazine in water and ethanol). Only 4 out of 63 systems have MPD values greater than

20% whereas 51 systems have MPD values lesser than 10%. These observations demonstrate

the ability of the proposed approach in obtaining a generalized model for various binary solvent

systems.

The solubility profiles shown in Figure 3.7 belong to two different systems obtained at

various temperatures. The experimental solubility curves in case of system 1 (Acetanilide –

Dioxane – Water) are irregular owing to the noise in solubility estimations whereas the

predicted solubility curves are smooth indicating the theoretical behavior of solvent

interactions in the system. The solubility profiles in system 2 (Caffeine – Ethyl acetate –

Ethanol) show a clear dependency of solubility on temperature. The proposed multiple model

(MM) approach is able to capture the temperature dependency effectively. It is interesting to

note that predictions using multiple models (MM) approach are very accurate, even though

magnitudes of the experimental solubility fractions are of different orders in these two systems.

It can be concluded from the solubility profiles that multiple models are far superior for

prediction of drug solubilities over the models that are available in the literature. This can be

seen in the tremendous improvement in the R2 values between these approaches (Tables 3.2

and 3.3).

3.6 Conclusion

In this study, a QSPR based approach using multiple linear models is examined to predict

64
drug solubility. Drug solubility is assumed to behave in a piecewise linear fashion in different

partitions of structural features. The temperature term in the Jouyban-Acree model is

normalized with room temperature to avoid the influence of temperature magnitude on model

coefficients. For this QSPR approach, various structural features of the solute and both solvents

are selected and processed through feature selection using GA. The log solubility values are

initially assumed to be linearly dependent on structural features and model coefficients are

obtained using OLS and weight based optimization approach. The weight-based optimization

approach is shown to be better than the OLS approach; however, both models were not of high

enough fidelity. Later, log solubility values are assumed to be piecewise linearly dependent on

structural features and model coefficients are identified using a modified PE based clustering

algorithm. A new testing strategy is also proposed to identify a suitable model for test samples

in case of PE based clustering approaches. The prediction efficacy of the final pair of models

is tested on a global test set. The MPD and R2 values demonstrate that the final set of models

can be used to predict the solubility of drugs irrespective of the solute and solvents involved in

the system and temperature.

In this chapter, we tested the efficacy of multiple model learning to identify the non-linear

behavior between the feature set and the property of interest i.e. solubility in this work. Initially,

GA is used to identify the significant feature set. It is observed in the feature selection phase

carried using K-fold validation that not all the features are significant in all the partitions of

data. A single set of features can be selected as important in a single model case, whereas doing

this in a multiple modeling framework might be suboptimal. I In the case of multiple models,

where each partition contains a different linear model, it will be beneficial to identify

appropriate significant features for each partition. To address this issue, we propose some

modifications to the existing prediction error based clustering approach in the next chapter to

obtain the number of underlying models and their orders in a single framework.

65
CHAPTER 4

Prediction error based fuzzy clustering approach using statistical

analysis for piecewise linear model identification

Several engineering systems can be modeled using a multiple model framework, where

specific models describe the system in different regions that are defined by input partitions.

Though this multiple model framework has been applied with different names in different fields

i.e. Linear parameter varying systems, Operating Regime based Models, Multiple Model

Estimation, Piecewise Models, Local Regression models, etc., the working principles are

generally the same [87]. Multiple model learning (MML) is a procedure for estimating input

partitions and their corresponding model parameters in each partition. MML has been applied

in numerous applications such as image segmentation in computer vision [88], image

processing, pattern recognition [89], financial investing, home insurance [90]. In chemical

engineering, MML has been applied in control of distillation column [91], fermenters [92],

solar power plant [93] and fluid catalytic cracking unit [94]. MML problems can either be static

or dynamic. Further, models can be segregated based on linearity or non-linearity in

parameters. A recent two-part review [79], [80] provides comprehensive coverage of the

multiple model approaches for modeling and identification of complex systems. The review

addresses different approaches used to identify the input partitions, internal model structure,

and parameter estimation. The review also highlights the challenges in the development of

multiple model identification approaches and their application in different fields.

There exist three varieties of linear MML problems as shown in Figure 4.1. Primary level

66
MML problems are ones where the number of true models and input partitions are known.

These kinds of problems can be solved by applying ordinary least squares (OLS) for data points

belonging to each partition. Secondary level MML problems are ones where the number of true

models is known but input partitions are unknown [95], [96]. These problems can be solved by

a two-step approach; initially obtaining input partitions by a suitable clustering technique and

then applying OLS to compute model parameters in each partition. Tertiary and advanced level

MML problems are particularly challenging as neither the number of true models nor input

partitions are known [97], [98]. There is, however, some work attempting to solve this, starting

with a sufficiently high number of models and merging them subsequently [97], [98].

Multiple Model Learning (MML)

Primary Secondary Tertiary

Number of models Number of models Number of models


and input partitions known and input and input partitions
known partitions unknown are unknown

Figure 4.1 Multiple Model Learning Problem Classification

In case of static multiple linear models, since input variables might significantly influence

output variables only in some partitions of input space, true models of different orders can

exist. While solving static MML problems, in general, orders of models are considered to be

known and equal. In the case of dynamic MML problems, predicting true model orders (ny, nu)

is not a trivial task. Hence to solve such problems both the number of models and their orders

are assumed to be known [97], [98]. In the absence of this information, as would be the case

generally, estimation of both the number of models and their orders becomes a challenging

67
task. In this chapter, we focus on tertiary problems where we solve such problems using an

iterative approach - which consists of prediction error-based clustering and statistical

significance testing - without any assumptions on the number of underlying models or their

orders.

4.1 Literature review:

In this section, we briefly explain the evolution of clusterwise regression and various

algorithms proposed to solve static and dynamic (PWARX) multiple models. Multiple model

learning (MML) problems have been in focus for the past few decades. Early investigation of

MML was carried in the form of clusterwise regression [99], [95], [96], [100], [101], [102].

Spath [96] introduced clusterwise regression and proposed an algorithm to obtain input

partitions and calculate corresponding model parameters when the number of models is known.

The time complexity of the above algorithm is improved in a subsequent article [100]. DeSarbo

and Cron [99] proposed a conditional mixture maximum likelihood methodology to solve

clusterwise regression problems. Since clusterwise linear regression problems can be treated

as combinatorial optimization problems, DeSarbo et al. [95] proposed a simulated annealing

based approach to solve these problems. Hennig [101] developed three types of models for

clusterwise linear regression named as fixed partition model, finite mixture model with fixed

and random regressors. In a follow-up paper [102], identifiability of clusterwise parameters

was studied. Wedel and Kistemaker [103] applied a generalized clusterwise linear regression

method to solve benefit segmentation problem. Monte-Carlo approach is used to test the

significance of obtained cluster parameters. Frigui and Krishnapuram [88] proposed a robust

competitive agglomeration (RCA) based clustering approach to solve multiple model general

linear regression (MMGLR) problems. They used a least square prediction error as a measure

for clustering instead of the standard c-means distance. Clusterwise linear regression for a

continuous stochastic process is studied by Preda and Saporta [104].

68
Cherkassky and Ma [105] proposed an iterative framework to solve MML problems

assuming that the majority of the data samples are generated by one dominant model. In the

first step, a dominant model is estimated from entire data and corresponding data points are

separated out. Next, in the residual data points, another dominant model is estimated, and

corresponding data points are separated. This procedure is repeated until it satisfies any

predefined stopping criteria. The model parameters are estimated using a support vector

machine (SVM) based regression. Bezdek et al. [106] generalized fuzzy c-means clustering

algorithms for linear models and proposed an iterative approach (fuzzy c-lines) to identify

multiple linear models. They also provided theoretical proof for convergence. Dufrenois and

Hamad [107] proposed a new approach to simultaneously estimate multiple linear models

based on support vector regression (SVR). In their formulation, fuzzy weights are assigned to

all data points and updation of weights are mimicked from the standard fuzzy c-means

algorithm. Elfelly et al. [108] proposed a two-step procedure for the identification of multiple

models. In the first step, the number of underlying models is predicted using neural networks

with rival penalized competitive learning. In the second step, model orders and their parameters

are estimated using both K-means and fuzzy K-means clustering algorithms. The proposed

approach is applied to two nonlinear data sets as validation exercises.

MML for dynamic systems is of interest in various real-life applications [109], [110]. Most

studies in the literature consider piece-wise autoregressive exogenous (PWARX) models as

benchmark problems for dynamic MML problems. Ferrari-Trecate et al. [97] proposed a

clustering based algorithm to solve PWARX problems assuming a number of models and their

model orders. This algorithm initially identifies local data sets (LDs) then parameter vectors of

all LDs are calculated using least squares. They use K-means clustering to group parameter

vectors into distinct models equal to the number of models assumed and bijective maps to

obtain input partitions. Nakada et al. [98] proposed a Gaussian mixture model to recognize

69
PWARX models assuming a number of models and their model orders. Parameters of identified

models are calculated using least squares. Kuppuraj and Rengaswamy [78] proposed a

prediction error (PE) based approach to obtain input partitions and model parameters

simultaneously. Support vector classifiers are used in the above studies [78], [98] to obtain

boundary hyperplanes between adjacent clusters. PWARX models can also be identified using

lifting techniques [111]–[113]. Rodolfo et al. [114] proposed an approach to identify non-linear

dynamics using multiple heterogeneous models (i.e of different orders).

The procedure of evaluating the clustering results such as partitioning of input variable

space and number of models obtained is known as cluster validation. For example, while

solving well-known classified data sets, if the algorithm can find the number of underlying

clusters and classify data points exactly as per the pre-specified boundaries then the proposed

algorithm is considered robust. Clustering results are dependent upon input parameters, random

initialization of clusters and the clustering approach. Cluster validation is classified into three

categories; external, internal and relative validation [115]. External validation is comparing

clustering results with known models whereas internal validation is testing results with

predefined indices. Relative validation is comparing results with a different clustering scheme.

Rendon et al. [116] reviewed different external and internal validity procedures. Wu et al. [117]

used a cluster validity index as a fitness measure for offspring evaluation in a genetic algorithm

framework to solve the feature selection problem.

Several studies examine clustering performance with statistical testing, but no attempt has

been made to use statistical testing in combination with a clustering approach to find true model

orders. In this chapter, f-test [118] is incorporated with a clustering approach to obtain true

model orders, in an iterative manner, by testing the significance of parameters of each model.

This chapter is organized as follows. While a brief literature review on MML was provided in

section 1, in section 2, we propose an iterative clustering method with statistical testing to

70
remove insignificant input variables to obtain true model orders. In sections 3 and 4, we report

the clustering results obtained using the proposed approach on different benchmark problems.

In section 5, we demonstrate the efficacy of the proposed approach on two relevant engineering

problems. We conclude this chapter with a balanced discussion on the merits and demerits in

the last section.

4.2 PE based fuzzy clustering with statistical significance testing

The objective of this work is to estimate input partitions, true model orders and model

parameters for the data generated using multiple linear models of different orders. In the first

phase, the original data is classified into clusters using FMC approach [78]. The key difference

between standard clustering approaches and FMC approach is that while the standard clustering

approaches use a Euclidean distance metric, the FMC approach uses prediction error as a

distance metric. This makes it possible to adapt clustering approaches to the multiple model

learning problem. Viewed another way, while standard clustering approaches work on the data

space, the FMC approach works on an abstract parameter space, where the number of clusters

equals the number of models that describe the system of interest,

In the second phase of the algorithm proposed in this chapter, a statistical significance test is

carried out on the parameters of each model. Insignificant variables are removed thus reducing

the order of that model. This two-step procedure is iterated upon until the clustering procedure

with modified model orders contains only significant variables. Using this approach, it now

becomes possible to identify multiple models with different orders efficiently using the

integration of clustering and statistical testing. This makes it possible to solve MML problems

where the number of partitions, input space partitions, and the models in each of these partitions

are not known. We describe the proposed algorithm in detail below.

71
Phase 1: FMC Algorithm - Clustering for the identification of input partitions and draft models

Summary: FMC estimates different models in the input space of the form y  Cx , where C

contains the model parameters in an input partition. In the initialization phase of the algorithm,

the dataset is randomly segregated into a pre-specified number (N) of clusters. Initial model

parameters are calculated using OLS regression on data points corresponding to each model.

In the clustering phase, prediction error and membership value of each data sample with respect

to each of the clusters are calculated. The model parameters of each cluster are updated using

an update algorithm. The above procedure is repeated until it satisfies any one of the specified

convergence criteria. In model rationalization phase, the similarity between each of the

estimated models is calculated using a cosine-similarity measure. If the angle between any two

models is less than a specified threshold (5o), the data points used for prediction of the two

models are combined and new model parameters are predicted. Models with less number of

data points are discarded and reassigned to clusters that suit them the best.

1. Initialize N random models with different parameter values.

2. Compute the initial prediction errors using the Equation (2.10)

3. Compute the membership values using the Equation (2.11)

4. Update the cluster centers using any one of the following update algorithms:

1
M  M 
Algorithm I: C i
r 1
   ijq y j xTj   ijq x j xTj  (3.1)
 j 1  j 1 

Algorithm II: Cir 1  Cir   r g r (3.2)

5. Compute new prediction errors

6. Calculate the root mean square error using the Equation (2.15)

72
7. Terminate based on a criterion (RMSE less than predefined limit or number of iterations

exceeds the limit) and go to next step else go to step 3

8. Merge like models based on a cosine angle metric defined in Equation (2.16) and obtain

new model parameters using OLS. Finally, report final models and input partitions.

Phase 2: Statistical significance testing of variables (F-test):

Significance of a single or set of variables can be tested by computing the increase in

regression sum of squares that results by the addition of the set of variables to the existing

variables. It works on a hypothesis that if the computed f0 value is less than the distribution

value f ,r ,n p then that variable set is insignificant. To test this hypothesis, we need to calculate

the increment in the sum of squares due to the addition of a set of variables to the model and

residual mean square error of the model.

The increment in the sum of squares added by variable j for model Y  0  j X j   k X k :


SSR  j 0 , 1 , ,  j 1 ,  j 1 
, k  SSR  0 , ,j, , k   SSR  0 , ,  j 1 ,  j 1 , k  (3.3)

 y 
n 2
i  yi
The residual mean square error of the model: MS E  i 1 (3.4)
n p

2
 n 
  yi 
Regression sum of squares with variable j: SS R   0 , 1 , ,  j , ,  k   yy   
T i 1
(3.5)
n

Regression sum of squares without variable j: SS R   0 , ,  j 1 ,  j 1 , k  


k


r  0,  j
 k Sx y
r
(3.6)

 n  n 
  i , r   yi 
x
  xi , r yi   i 1  i 1 
n
Where S x y and  can be estimated by applying ordinary least
r
i 1 n

squares (OLS) to model Y   0   1 X1   j 1 X j 1   j 1 X j 1   k Xk .

Now  f0  value can be estimated as follows: f0 



SSR  j 0 , 1 , ,  j 1 ,  j 1 , k  (3.7)
MSE

73
If computed f0 value is more than f ,r ,n p then variable set j is considered as significant [118].

Iteration:

Re-run phase 1 with initial models to be ones that are identified after phase 2 statistical

testing. After phase 1, test statistical significance of the new model parameters using phase 2

computations. If no further revisions are made to the model orders, the MML algorithm stops

and the partitions, the corresponding models and the parameters are output of the algorithm. If

there are further revisions, the iteration continues till model revisions are not necessary. A flow

chart of proposed approach, where these two phases are iterated upon is provided in Figure 4.2.

Further technical details of phase 1 update algorithms:

The algorithm I is a traditional optimization algorithm to update cluster centers (based on

first ordinary necessity conditions). The objective for the optimization problem is as shown in

Equation 1. The first order necessity condition is that the first derivatives at optima are zero.

This condition is:

f  1 N  M q 
    ij y j  Ci x j
2
    0; (3.8)
Ci Ci 2 
 2 i 1  j 1 

After differentiating with respect to Ci and equating it to zero, the optimal solution obtained

will be:

1
M  M 
Ci    ijq y j xTj   ijq x j xTj  (3.9)
 j 1  j 1 

Algorithm II is a line-search based optimization update to minimize the sum of squares of

prediction errors assuming linear models. Line search is a gradient-based optimization, where

at each step the direction and the magnitude of next step are evaluated until an optimal solution

is reached. The gradient at rth iteration is as follows:

74
Define algorithmic Distribute whole data
Predict model parameters
parameters such as number randomly into groups with
using OLS regression
of models assumed, Max. equal members in each group
Iterations and fuzzifier
value. Set Iter as 0;
Reduce model order by Compute prediction error
removing insignificant and membership values
variables. Set Iter as 0;

Yes
Iter = Iter + 1
Report converged model No Does any model
parameters and their has insignificant
respective data members variables?
Update model parameters
using any FMC algorithm

Retain clusters with less


Conduct F-test on model
than 5% of total data size
parameters with α as 0.01 Compute prediction error
and add to suitable clusters for all data points using
updated models

Compute angle between RMSE tol (or)


cluster parameters of same Yes No
Iter Max. Iterations Compute membership
order and merge clusters
with angle less than 5o based on prediction error

Figure 4.2 Flow chart of PE based clustering using variable significance testing for MML

75
f  M q 
g r     ij  y j  Cir x j  xTj  (3.10)
Ci  j 1 

Magnitude of step i.e. step length, at rth iteration can be obtained using the following equation:

   g x   y  Cir x j 
M N T
q r
ij j j
j 1 i 1
r  (3.11)
 ijq  g r x j   g r x j 
M N T

j 1 i 1

The parameters are updated using: Ci


r 1
 Cir   r g r

4.3 Efficacy of proposed approach to estimate static multiple linear regression

(SMLR) models

In this section, we demonstrate the performance of the proposed approach on four example

problems. In all four examples, models with different orders are considered. The key

contribution of the proposed approach is to predict the different order models without any

assumption about underlying models, except linearity. The following examples shed light on

the efficacy of the proposed approach to estimate static multiple linear regression models. The

first three examples are multi-input single-output (MISO) systems and the fourth example is a

multi-input multi-output (MIMO) system. The first example shows the requirement for

statistical testing by comparing the clustering results obtained with and without statistical

significance testing. It also tests the ability of the proposed approach in removing insignificant

variables thus reducing models from predefined (large) order to true model order. The second

example is considered to examine the performance of the current approach with a change in

the size of data sets while the input partitions remain same. Data sets are uniformly generated

within the partitions. The third example is considered to test the current approach on data sets

with alteration in sampling of data. The fourth example tests the ability of proposed approach

in identifying the multiple models in case of MIMO systems. All case studies are solved using

76
the two FMC algorithms with a fuzzifier value of 2. Models of similar order are merged at the

end of the FMC algorithm if the angle between the models is less than the threshold (5o).

4.3.1 SMLR example 1

This problem is synthesized to show the efficacy of the proposed approach in reducing the

dimensionality of models from a high value to true model orders. In this example, we consider

a data set of 1000 samples of 10 variables each. Data consist of three underlying models of true

orders 5, 5 and 3 (6, 6 and 4 including intercepts) respectively. Model information is provided

in Table 4.1. None of the models are dependent on ‘Variable 2’, which is added to the data to

check the ability of the proposed approach in removing insignificant variables. In this example,

most of the data points are generated by a single model following the assumption of Cherkassky

and Ma [105]. Noise generated with uniform distribution in the range of [-0.1 0.1] is added to

the output variable Y.

Six models of order 10 (11 variables including the intercept) are initialized to solve the

current problem with both FMC algorithms. Converged model orders, parameters and number

of data points in each model with and without statistical testing are shown in Table 4.1. All

significant variables in true models are boldfaced in Table 4.1, which can be filtered out from

converged models if we use statistical analysis. Without statistical analysis, though the input

partitions are accurately obtained using the FMC I algorithm, converged models have

insignificant variables. Using the FMC II algorithm, data samples belonging to one true model

(C2) converge to two models M2 and M3. Due to the presence of insignificant variables in

models M2 and M3, the similarity measure (angle between the models) becomes higher than

the tolerance (5o). If the coefficients corresponding to the insignificant variables become zero,

then the similarity measure would become less thus resulting in merging of the models. It can

be observed from Table 4.1 that the proposed approach is able to completely remove the

77
insignificant variable (Variable 2) and obtain the true model orders of 5, 5 and 3 from the

initialized models of order 10.

4.3.2 SMLR example 2

This example consists of four models of different orders. Since this problem is generated to

test the current approach on different data sizes, three different cases are tested with 200, 500

and 1000 total data points respectively. An equal number of data points are generated in each

partition (using the model for the chosen partition). Noise generated with uniform distribution

in the range of [-0.1 0.1] is added to the output variable Y. Six models of order 5 (6 – with

intercept) are initialized. Both FMC algorithms are tested to identify input partitions, true

model orders, and their parameters. Model information and clustering results are reported in

Table 4.2. With the increase in data size, the estimated parameters of models in each partition

can be seen to be converging towards the true model parameters. This ensures the consistency

of the proposed algorithm.

4.3.3 SMLR example 3

This is an example to show the efficacy of the proposed approach on different data sampling.

Though the data has five input variables, for better visualization of partitions, data is plotted

based on only two variables. The example includes three case studies. In the first case study,

data is uniformly distributed throughout the variable space and an equal number of data points

are considered for each partition. In the other two case studies, two partitions contain an equal

number of data points and the remaining two have an unequal number of data points where one

has almost half of the total data points. In the second case study, data is randomly generated

throughout the partitions, while in the third case, data is concentrated towards the boundaries

of partitions. Euclidean distance-based clustering algorithms will fail in these types of

situations. The data partition in all three case studies is shown in Figure 4.3.

78
Table 4.1 Original and converged model details of SMLR example 1

FMC No of
Models information
algorithm points
1.8 0.75 0.4 0.7 0.7 0.6  X 3 X 4 X 5 X 9 X 10 1
'
C1 600
Original
 0.3 2.5 1.9 3 0.9  2.5  X 5 X 6 X 7 X 8 X 10 1
'
C2 300
model
 1.2 0.2 2.5 0.1  X 1 X 9 X 10 1
'
C3 100
Without statistical analysis
M1 = [-0.0002 0.0003 1.8003 0.7496 0.4007 -0.0003 0.0005 -0.0002 0.7003 0.6999 0.5997] 598
I M2 = [0.000 0.000 0.0005 0.0001 -0.3001 2.5001 1.8996 3.0002 -0.0005 0.8999 -2.4991] 301
M3 = [-1.1986 0.0011 -0.0006 -0.0003 0.0014 -0.000 -0.0006 0.000 0.1996 2.4741 0.3250] 101
M1 = [-0.0007 0.0004 1.8005 0.7499 0.4005 -0.0005 0.0009 0.0001 0.7005 0.6977 0.5887] 603
II M2 = [-0.0601 0.0587 -0.0293 0.0087 -0.2365 2.4463 1.8407 3.0162 -0.0432 0.5382 -0.3176] 94
M3 = [-0.0004 0.000 0.0010 -0.0007 -0.2999 2.5005 1.8996 3.0001 -0.0008 0.9019 -2.5104] 198
M4 = [-1.1946 0.0040 -0.0034 -0.0026 0.0023 0.0023 0.0007 -0.0006 0.1999 2.3173 1.7436] 105
With statistical analysis
M1 = [ 1.8003 0.7496 0.4007 0.7002 0.6999 0.5999] 598
I M2 = [ -0.3001 2.5001 1.8997 3.0002 0.8996 -2.4974] 301
M3 = [-1.1985 0.1993 2.4748 0.3208] 101
M1 = [1.8005 0.7495 0.4006 0.7003 0.7000 0.6000] 599
II M2 = [-0.3001 2.5001 1.8997 3.0002 0.8996 -2.4974] 301
M3 = [-1.1949 0.1980 2.3295 1.6316] 100

79
Table 4.2 Original and converged model details of SMLR example 2

Case FMC No of
Model
study algorithm points
1.8  2.5 0.2  0.7   X 1 1
'
C1 X4 X5
 2.2 0.9  0.9 0.6  1.5  X 1 X 2 X 3 X 5 1
'
Original C2
- -
0.6  1.3 0.8 0.3 0.5  X 2 X 3 X 4 X 5 1
'
model C3
0.6 1.5 0.3  0.5  X 1 X 2 X 5 1
'
C4
M1 = [1.8012 -2.4993 0.2165 -0.6429 ] 49
M2 = [ -2.2003 0.9002 -0.8978 0.6184 -1.4806 ] 48
1 I, II
M3 = [ 0.6033 -1.2992 0.7973 0.2768 0.5192 ] 52
M4 = [ 0.5991 1.4988 0.3102 -0.5410 ] 51
M1 = [ 1.8007 -2.5010 0.2097 -0.6666 ] 124
M2 = [ -2.1984 0.8987 -0.8991 0.6121 -1.4828 ] 125
2 I, II
M3 = [ 0.5996 -1.2995 0.8012 0.2935 0.5125] 126
M4 = [ 0.5975 1.4997 0.3011 -0.5065 ] 125
M1 = [ 1.7996 -2.5003 0.2074 -0.6728 ] 249
M2 = [ -2.1994 0.9004 -0.9001 0.5981 -1.4976 ] 253
3 I, II
M3 = [ 0.6007 -1.3004 0.7994 0.3070 0.4910] 250
M4 = [ 0.5990 1.5000 0.2971 -0.4912 ] 248

Table 4.3 Model information of SMLR example 3


No. Model
1.5 0.7 0.3  1.5 0.3  0.9  X 1 X 2 X 3 X 4 X 5 1
'
C1
 0.8 0.2  0.5 2.5 1  X 2 X 3 X 4 X 5 1
'
C2
0.7 1.7 1.9 0.8  X 1 X 2 X 5 1
'
C3
 3 2.3  1.5  X 1 X 4 1
'
C4

80
Table 4.4 Converged model details of SMLR example 3 without statistical analysis

Case FMC No of
Model
study algorithm points
M1 = [1.4995 0.6993 0.3000 -1.4980 0.2984 -0.8955] 248
M2 =[-0.0012 -0.8005 0.1995 -0.5037 2.4969 1.0468] 252
I
M3 = [0.7007 1.6969 0.0002 0.0015 1.8996 0.7994] 250
M4 = [-2.9994 0.0002 -0.0003 2.2978 -0.0061 -1.4424] 250
1
M1 = [1.4993 0.7094 0.3000 -1.4894 0.3418 -1.0579] 262
M2 =[-0.0119 -0.8267 0.1741 -0.4679 2.2743 3.0365] 234
II
M3 = [0.6968 1.7010 -0.0024 0.0195 1.8764 0.7242] 260
M4 = [-2.9996 0.0006 -0.0002 2.2981 -0.0057 -1.4477] 244
M1 = [1.4980 0.6996 0.2995 -1.5045 0.2987 -0.8681] 210
M2 = [-0.0001 -0.8020 0.2009 -0.5018 2.5000 1.0127] 487
I
M3 = [0.7007 1.7019 -0.0007 -0.0018 1.8957 0.8181] 153
M4 = [-2.9985 -0.0014 -0.0020 2.2898 -0.0029 -1.3822] 150
2
M1 = [1.4964 0.7017 0.3006 -1.5055 0.2828 -0.8476] 213
M2 = [-0.0016 -0.8032 0.2011 -0.4981 2.5012 1.0074] 500
II
M3 = [0.7026 1.7030 0.0009 -0.0015 1.8905 0.7996] 149
M4 = [-3.0010 -0.0017 -0.0259 2.0221 -0.0889 1.7319] 138
M1 = [1.4980 0.6996 0.2991 -1.4992 0.2977 -0.8834] 211
M2 = [-0.0002 -0.8018 0.2010 -0.5000 2.4989 1.0126] 490
I
M3 = [0.7007 1.7014 -0.0001 -0.0059 1.8968 0.8514] 149
M4 = [-2.9984 -0.0003 -0.0011 2.3039 0.0035 -1.5630] 150
3 M1 = [1.4279 0.6473 0.2462 -1.5946 0.1478 0.5445] 103
M2 = [-0.0017 -0.8044 0.2066 -0.5048 2.4858 1.1096] 518
II M3 = [-3.0459 -0.0662 -0.0429 1.7290 -0.2304 5.8630] 147
M4 = [1.4918 0.6652 0.2796 -1.5296 0.3263 -0.3758] 81
M5 = [0.7093 1.7055 0.0056 0.0147 1.8751 0.6297] 151

81
Case 1 Case 2 Case 3
10 10 10

8 8 8

6 6 6

X5

X5

X5
4 4 4

2 2 2

0 0 0
0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10
X4 X4 X4

Figure 4.3 Partition of input data for SMLR example 3(* - C1, o - C2, - C3, + - C4)

Table 4.5 Converged model details of SMLR example 3

Case study FMC algorithm Model No of points


M1 = [1.4995 0.6994 0.3001 -1.4981 0.2983 -0.8969] 247
M2 =[-0.8004 0.1994 -0.5037 2.4966 1.0433] 252
1 I, II
M3 = [0.7007 1.6969 1.8996 0.8121] 250
M4 = [-2.9990 2.2981 -1.4929] 251
M1 = [1.4980 0.6996 0.299 -1.5045 0.2984 -0.8674] 209
M2 = [-0.8020 0.2009 -0.5017 2.5001 1.0114] 489
2 I, II
M3 = [0.7008 1.7019 1.8957 0.7991] 152
M4 = [-2.9985 2.2909 -1.4316] 150
M1 = [1.4980 0.6996 0.2991 -1.4992 0.2977 -0.8834] 211
M2 = [-0.8018 0.2010 -0.5000 2.4989 1.0115] 490
3 I, II
M3 = [0.7006 1.7014 1.8964 0.8026] 149
M4 = [-2.9985 2.3047 -1.5505] 150

82
In case 1, each partition has 250 data points whereas in case 2-3, the four partitions have

210, 490, 150 and 150 data points respectively. Model information is provided in Table 4.3

Noise generated with uniform distribution in the range of [-0.1 0.1] is added to the output

variable Y. All three case studies are initially solved using both FMC algorithms without

statistical analysis and the clustering results obtained are reported in Table 4.4. This problem

is solved using the proposed approach by initializing six models of order 5 (6 variables

including the intercept) and the corresponding results are reported in Table 4.5. The benefits of

the proposed approach are evident from the clustering results. Without statistical analysis, FMC

II algorithm is unable to identify the exact number of underlying models in case 3 and input

partitions in all the three cases. However, input partitions obtained using FMC I are

satisfactory, converged models have insignificant variables. The proposed approach adequately

predicts the input partitions and the respective model parameters using both FMC algorithms,

though there is minor misclassification of data points.

4.3.4 SMLR example 4

The next step is to test the performance of the proposed approach on MIMO systems. The

example chosen contains multiple outputs in different partitions with models of varying orders,

as can be seen in Table 4.6. The differences in model orders for each output in a partition make

this a challenging problem. Data points are generated from 3 different models which involve 5

input variables. Each model consists of 100 data points. Noise generated using uniform

distribution in the range of [-0.1 0.1] is added to the output variable Y. This problem is solved

using the proposed approach by initializing six models of order 5 (6 including the intercept).

Information about original models and converged models is provided in Table 4.6. It is

interesting to note that in case of MIMO systems, multiple outputs help in improving clustering

efficiency since each model will have a prediction error for each data point, which is

83
consolidated from multiple output predictions. The proposed approach can estimate the exact

input partitions and model parameters. There are minor variations in model parameters due to

the addition of noise to the data.

Table 4.6 Original and converged models of SMLR example 4 using both FMC algorithms

No of
Model Original model Converged model
points
[1.4980 -1.0010 0.7579 -0.4708]
1.5  1 0.75  0.5 X1 X 4 X 5 1
'
1 100
1 0.65  0.1 1 0.5 X1 X 2 X 4 X 5 1 [0.9980 0.6505 -0.1011 1.0079 0.5294]
'

1.5  0.6 0.9 1.3 X 2 X 3 X 5 1


'
[1.5018 -0.6017 0.8990 1.2972]
2 100
1 0.75  0.5 0.75 1 X1 X 2 X 3 X 5 1
'
[1.0008 0.7518 -0.5018 0.7492 0.9975]
0.2 2 0.7 0.3  1.25 X1 X 3 X 4 X 5 1
'
[0.1999 2.0001 0.6979 0.2979 -1.2299]
3  1.5 2 0.75 1 1.5 X X 3 X 4 X 5 1
' 100
2
[-1.5000 2.0001 0.7479 0.9979 1.5202]

4.4 Efficacy of proposed approach to identify PWARX models

PWARX models are like SMLR models except that the problem is realized in a dynamic

setting and the output at a given time depends on outputs from previous times and the

exogenous inputs from previous times. Thus, the regressors here are the time-lagged outputs

and inputs. The order is dependent on the number of time outputs (ny) and inputs (nu), affecting

the current output. In this section, we demonstrate the performance of the proposed approach

in identifying piece-wise autoregressive exogenous (PWARX) models. The section is divided

into two subsections. In the first subsection, we have tested the proposed approach on a

PWARX problem from the literature [78]. In the following subsection, we test the approach on

an example to show the effectiveness of the approach in reducing high assumed model orders

to true orders in the dynamic case. Both examples are SISO systems and contain models with

different (ny, nu) orders. These case studies are solved with both FMC algorithms using a

fuzzifier (q) value of 2. Similar order models are merged at the end of each convergent iteration

if the angle between the models is less than 5o.

84
4.4.1 PWARX example 1

This example is adapted from the literature [78] and solved using two initial guesses (2 cases).

In the first case the initial guess of both ny, nu is assumed to be two. In the second case a higher

model order is assumed (both ny, nu as 4). It is shown that both the guesses converge to true

model orders. Input data is generated with uniform distribution in the range of [-10 10] and is

partitioned into four different models. Noise generated with uniform distribution [-0.1 0.1] is

added to the output variable. Since the input data is randomized, we could not regenerate the

exact number of data points in each model as reported in the literature [78]. Number of data

points generated in case 1 by models C1, C2, C3, and C4 are 150, 85, 154, and 111 respectively.

On the other hand, number of data points generated in case 2 by models C1, C2, C3, and C4

are 150, 86, 153, and 111 respectively.

As mentioned earlier in case 1, six models of order five (ny, nu assumed to be 2, and an

intercept) are initialized to solve the problem while in case 2 six models of order nine (n y, nu

assumed to be 4, and an intercept) are initialized. The details of true and converged models are

given in Table 4.7. Except for a few misclassifications, the proposed approach is able to identify

input partitions, a number of underlying models and their true orders. In the second case despite

assuming higher order models, the proposed approach can identify insignificant variables and

converge to true models. It is evident from this example that for dynamic case studies the

proposed approach can identify true model orders through initialization with reasonably high

model orders.

4.4.2 PWARX example 2

This is an example consisting of four models with different orders of ny and nu varying from

one to three. Input data is generated from a uniform distribution in the range of [-10 10] and

noise added to the output variable is generated using a uniform distribution in the range of [-

85
0.1 0.1]. The number of data points generated from models C1, C2, C3, and C4 are 128, 108,

127 and 137. This problem is solved using both FMC algorithms by initializing six models of

order 7 (ny and nu as 3 and an intercept). The proposed approach accurately predicted the

number of true models and their orders. The original models and number of data points in each

model and their model parameters obtained using both FMC algorithms are shown in Table

4.7. It can be observed from the results that the proposed approach is able to converge to true

model orders (3, 7, 5 and 4 respectively) from assumed higher orders (all are 7). A close look

into the data suggests misclassification of a few data points, a result of the noise added to the

output variables at each time instance.

4.4.3 PWARX example with non-linear dynamics

In this study, we considered the piece-wise non-linear dynamic model from Gegundez et al.

[119]. Original models contain input and output variables of the previous time step and their

squared terms. This problem is initially solved using OLS and the linear model obtained has

statistically poor performance (R2 = 0.3529) in identifying the non-linear behavior. Hence, we

tried to solve the above problem using both FMC algorithms by initializing four linear models

(involve non- linear variables). The original model is simulated for 2000 time samples, of

which the first 1600 (240 of M1, 695 of M2 and 665 of M3) samples are used to identify

multiple models and the remaining samples are used to validate the obtained models. Input U 

and noise  E  are generated using uniform distribution in the range of [-4 4] and [-0.1 0.1]

respectively. For each sample in test data, a suitable cluster is predicted using k-nearest

neighbors (KNN) approach with a k value of 5. Information about original and converged

models is given in Table 4.8.

86
Table 4.7 Original and converged models of PWARX example 1 and 2 using both FMC algorithms

No of
Original model Case Converged model
points
  0.7 0.7 U k 1 1'  Ek [-0.6958 0.7319] 148

 if U k 1   10  4 [-0.4015 -0.6986 -0.2963] 87
 1
  0.4  0.7  0.3Yk 1 U k 1 1  Ek [0.5997 -0.2029 -0.3008 0.1077] 151
'

 if U   4 0 [0.3009 -0.7003 0.4934 0.5505] 114


k 1
Yk  
 0.6  0.2  0.3 0.1Yk 1 U k 1 U k  2 1  Ek [-0.6958 0.7330] 147
'

 if U  0 6
 k 1   [-0.4002 -0.7027 -0.3095] 88
2
 0.3  0.7 0.5 0.5Y Yk  2 U k 1 1  Ek
'
[0.5994 -0.2023 -0.3008 0.1077] 151
 k 1

 if U k 1   6 10 [0.3009 -0.7003 0.4934 0.5505] 114


  0.8 0 0 1.5 0 0  0.7    Ek
 M1 = [0.7999 1.5075 -0.6428] 128
 if U k 1   10  5
 0.7  0.2 0.4  0.9  0.5 0.8  0.5   Ek

 if U k 1   5 0 M2 = [ 0.7003 -0.2007 0.4006 -0.9050 -0.4986 0.800 -0.5137] 110
  0.3 0.8 0  0.4 0.3 0 0.5   E
Yk   k
 if U k 1   0 5
 M3 = [-0.2998 0.8003 -0.4052 0.2997 0.5179] 127
  0.6 0 0  0.7 0.8 0  0.9    Ek
 if U k 1  5 10

 where   Y Y Y U U U 1' M4 = [0.5995 -0.7023 0.7998 -0.8901] 135
 k 1 k  2 k  3 k 1 k 2 k 3

87
The bold parameters (zeros) in the table are identified as insignificant by the proposed

approach. The R2 values for test data set using both FMC algorithms are 0.949 and 0.924

respectively. The high magnitude residual error for few data points in the test data set is because

of the wrong choice of KNN model, which is caused by the misclassification of data points at

the boundary. This can be avoided by using alternative validation methods such as weight-

based nearest neighbors approach, condensed nearest neighbor approach, etc. It can be

observed from Table 4.8 that both FMC algorithms are able to identify non-linear dynamics of

variables using multiple linear models adequately.

4.5 Efficacy of the proposed approach on two real-life case studies

4.5.1 Identification of energy performance of residential buildings

In this case study, eight input parameters: relative compactness (𝑥1 ), surface area (𝑥2 ), wall

area (𝑥3 ), roof area (x4 ), overall height (x5 ), orientation (x6 ), glazing area (x7 ), and glazing

area distribution (x8 ) are used to predict the energy performance of residential buildings

characterized by two output variables: heating load (𝑦1 ) and cooling load (𝑦2 ). Tsanas and

Xifara [120] studied the effect of these input parameters on output variables using statistical

machine learning tools. Galzing area followed by relative compactness are identified as the

most significant input variables based on the importance metrics estimated using random

forests modeling. Linear models for the two outputs are independently built using iteratively

reweighted least squares.

In our work, for this case study, 768 samples collected from the UCI machine learning

repository [120] is used. This data is randomly divided into training and testing sets. 614

samples are used to obtain multiple models, whereas the remaining samples are used to test the

performance of the developed models. Initially, output variables are assumed to be linearly

dependent on input variables and the model parameters are identified using ordinary least

88
squares (OLS). Both RMSE and R-squared metric values of training and testing sets suggests

that the prediction accuracy can be further improved using either a suitable non-linear model

or piecewise linear models.

Piecewise linear models are obtained using the proposed algorithm. Multiple models are

obtained using the FMC II algorithm with a fuzzifier value of 2 and a maximum number of

iterations as 500. Models of similar order are merged at the end of the FMC algorithm if the

maximum angle between the models is less than the threshold (5o). The model parameters

obtained with and without statistical testing along with the R-squared metric values are

tabulated in Table 4.9. K-nearest neighbors approach is used to identify a suitable model for a

new sample with a k value of 5. It can be observed from Table 4.9 that piecewise linear models

perform better than single linear models. Though the models obtained without statistical testing

perform slightly better than the models obtained with statistical testing, they contain

insignificant variables in the models. The bolded values in case of multiple models obtained

using statistical testing are identified as insignificant thus the true orders are identified. It is

interesting to note that both the glazing area (𝑥7 ), and relative compactness (𝑥1 ) exists in the

models that are obtained using statistical testing thus validating the claim made by Tsanas and

Xifara [120]. None of the models obtained using the statistical testing contain variables 2, 3

and 4 concluding that these variables do not affect the output variables at any operating

condition. However, these variables appear in the models obtained without statistical testing.

This case study illustrates the effectiveness of the proposed approach to obtain models with

only significant variables thus avoiding the redundant variables.

4.5.2 Identification of non-isothermal CSTR model dynamics for control

In this case study, the proposed clustering approach is used to identify dynamics of a non-

isothermal continuous stirred tank reactor (CSTR) with irreversible reaction using PWARX

models. The effluent concentration (𝑦1 ) and temperature (𝑦2 ) is controlled using the coolant

89
flow rate (𝑢). The process model consists of two nonlinear ordinary differential equations

[121]. The differential equations are simulated for 1000 time samples using MATLAB

Simulink with the model parameters and initial conditions provided in Nikravesh et al. [121]

for 1000 coolant flow rates (𝑢𝑖 ). The input values are generated at three different operating

regions. The input values and corresponding output values are shown in Figure 4.4. The data

is randomly divided into training and testing (local) sets in a ratio of 80:20. Underlying linear

models are identified assuming dynamic models are of order 2 (ny1 = ny2 = nu = 2).

To compare the performance of the proposed clustering procedure multiple models are

identified using two recently published clustering approaches proposed by Wang et al. [122]

namely local gravitation clustering (LGC) and communication with local agents (CLA). Wang

et al. concluded, by testing on different benchmark studies, that both LGC and CLA algorithms

perform better than some of the well-established clustering approaches like Density-Based

Spatial Clustering of Applications with Noise (DBSCAN). Along with these clustering

approaches a neural network (NN) approach was also tested to identify the non-linear

dynamics. The neural network is trained using a Resilient Backpropagation training algorithm

for 3 different configurations i.e. a number of hidden layers used are 1, 5 and 10.

In our approach, the FMC II algorithm is initialized with five models. LGC and CLA

algorithm codes are downloaded from MATHWORKS file exchange and cluster identification

is performed on the concatenated space of both input and outputs. Once the clusters are

identified, model parameters are calculated using the OLS approach. In the case of the NN

approach, the network is trained using 'trainrp' function available in MATLAB neural network

toolbox. It can be observed from the plots of yk Vs 𝑦𝑘−1 in Figure 4.5 that the simulated data

inherently consists of three clusters.

90
Table 4.8 Information of original and converged non-linear PWARX models for training data set

Original model Converged model Samples R2/RMSE


 M 1   0.4 1 0 0 1.5  xk   Ek [-0.256 0.996 0.018 0 1.759] 245
 0.9996/
if 4Yk 1  U k 1  10  0 FMC 1 [-0.311 0.50 0.001 0.50 -1.687] 696
 0.0574
 M 2   0.3 0.5 0 0.5  1.7   xk   Ek [0.501 -0.999 0.299 -0.1 -0.497] 659

 [-0.825 1.0129 -0.053 0 0.702] 251
Yk   if 5Yk 1  U k 1  6  0 0.9996/
 M 3   0.5  1 0.3  0.1  0.5  xk   Ek FMC 2 [-0.321 0.50 0.003 0.50 -1.673] 700
0.0615
 [0.503 -0.999 0.301 -0.1 -0.499] 649
 otherwise
 0.3529/ -
 where   xk   Yk 1 U k 1 Yk 1 U k 1 1
T
  2 2
 Single [-0.391 0.125 0.040 0.284 -1.023] 1600
2.3409

Table 4.9 Information of converged models and corresponding metrics for prediction accuracy

Model parameters   RMSE R2 value


Model type
Where y   xT and x   x1 , x2 , x8  x = [x1 , x2 … x8 ] Train Test Train Test

Reference [-20.595 -0.012 0.039 0 5.361 0.017 20.225 0.179; [2.970; [2.907; [0.912; [0.922;
Idea -19.437 -0.001 0.020 0 5.654 0.198 14.575 0.017] 3.214] 3.306] 0.885] 0.884]
Multiple 𝜃1 = [-10.855 -0.005 0.015 0 4.211 -0.031 17.241 0.233;
(without -6.371 0.002 0.001 0 4.112 0.118 11.864 0.0176] [1.758; [1.754; [0.969; [0.971;
statistical 𝜃2 = [-35.683 0.013 -0.006 0 7.583 0.045 20.871 0.007; 1.800] 1.921] 0.963] 0.960]
testing) -37.056 0.028 -0.034 0 8.187 0.261 15.681 -0.0818];
𝜃1 = [-12.929 0 0 0 4.874 0 17.751 0.187;
Multiple (with
-4.448 0 0 0 4.332 0 13.523 0 ] [1.892; [1.812; [0.964; [0.970;
statistical
testing)
𝜃2 = [-20.017 0 0 0 6.717 0 22.170 0 ; 2.146] 2.253] 0.949] 0.946]
0 0 0 0 4.837 0.173 12.635 0];

91
Though the identification procedure for FMC II algorithm is initialized with five models,

the proposed approach is able to converge to the true number of underlying models i.e. three

due to the iterative statistical testing procedure. In contrast, FMC without statistical testing

converges to five models. While LGC and CLA algorithms [122] were able to converge to

three models, they were not successful in identifying the true models orders. It is interesting to

note that all the converged models using statistical testing for output variable 2 are of the same

order representing unique dependency over the operating range. It can be observed from the

converged models that both the output variables are independent of 𝑢𝑘−1 throughout the

operating range but dependent on 𝑢𝑘−2 representing the delayed effect of input on the output.

To validate the models obtained using different approaches a different set of 400 data

samples (global test) are generated using the same model equations but for a different input

samples set. The inputs and corresponding output values are shown in Figure 4.4. Input is

generated using uniform distribution in the range of [93 113]. K-nearest neighbors approach is

used to identify a suitable model for a new test sample with a K value of 5. Various metrics of

prediction efficiency for training, local test, and global test sets are provided in Table 4.10. It

can be observed from Table 4.10 (bolded values) that the proposed strategy (FMC II with

statistical testing) performs better than the other techniques. Further, the proposed technique

also provides very interesting insights into the process (delayed response, unique dependency,

identification of redundant variables in a multiple learning framework) that the other

techniques do not provide.

The superior performance of the proposed approach is due to the true order models identified

in the training phase. Our technique performs better than NN in our simulation studies. Further,

we also derive interpretable models using our approach. The residual errors of the global test

set using several approaches can be seen in Figure 4.6. It is interesting to note that the NN

92
approach predicts higher values than original in case of y1 and lower values in case of y2 ,

whereas CLA predictions follow the opposite trend.

Figure 4.4 Simulated data - (a) 1000 data samples (training and testing) (b) Global test set

93
Figure 4.5 Plots of 𝑦𝑘 Vs 𝑦𝑘−1 signifying 3 inherent clusters in simulated data for both outputs

Figure 4.6 residual error of the global test set using different approaches

94
Table 4.10 Details of prediction accuracy of different model identification approaches

RMSE R2 value
Algorithm
Train Local Test Global test Train Local test Global test
FMC II (without statistical [8.8 E -4; [9.2 E -4; [0.005; [0.997; [0.997; [0.941;
testing) 0.087] 0.119] 1.236] 0.999] 0.999] 0.940]
FMC II (with statistical [0.001; [0.001; [0.0043; [0.996; [0.996; [0.953;
testing) 0.122] 0.138] 0.943] 0.999] 0.999] 0.965]
LGC [3.4 E -4; [4.1 E -4; [0.0047; [0.999; [0.999 [0.946;
(Z. Wang et al.) 0.0652] 0.071] 0.9714] 0.999] 0.999] 0.963]
CLA [3.3 E -4; [3.7 E -4; [0.0045; [0.999; [0.999 [0.949;
(Z. Wang et al.) 0.0695] 0.071] 1.003] 0.999] 0.999] 0.961]
1 hidden [0.002; [0.002; [0.005; [0.982; [0.981; [0.938;
Neural network layer 0.425] 0.416] 1.022] 0.991] 0.991] 0.959]
(using Resilient 5 hidden [0.004; [0.004; [0.010; [0.945; [0.946; [0.739;
Back layers 0.328] 0.264] 1.793] 0.994] 0.996] 0.874]
propagation) 10 hidden [0.002; [0.002; [0.011; [0.991; [0.991; [0.724;
layers 0.216] 0.143] 2.416] 0.997] 0.998] 0.772]

95
4.6 Conclusion

In this work, we have evaluated prediction error-based FMC approach with statistical

significance testing for both static and dynamic MML problems. We show that statistical

significance testing has an effect in predicting both input partitions and true model orders. In

the case of static MML studies, all models were initialized to sufficiently high order and show

convergence to true model orders. A similar observation is made for dynamic MML problems

as well. A key factor to be noted is that the number of models and initial guesses for orders

need to be greater than that of the underlying model. Though it is difficult to confirm this

assumption from data alone, an iterative increase of model order and a number of models can

ensure convergence. The proposed approach can predict true model information for SMLR

problems of different data sizes and data sampling. It can remove duplicate variables and

identify true model orders. The proposed approach is also shown to be useful in identifying

PWARX models with non-linear dynamics using multiple linear models. The proposed

approach also provides very interesting insights into the process such as identification of true

model orders, delayed response, unique dependency, and redundant variables in a multiple

model learning framework.

In this chapter, we proposed a prediction error based clustering approach with statistical

analysis that can identify underlying models and corresponding significant features set in a

single framework. It is observed that the proposed approach improves the clustering efficiency

and also provides interesting insights into the process. In the next chapter, we use the proposed

approach to obtain a non-linear relationship between structural features derived from first

principles and solvation free energy of Quinone derivatives in a QSPR framework without

using any additional feature selection algorithm.

96
CHAPTER 5

Prediction of solvation free energy of Quinone derivatives using

machine learning approaches in a QSPR framework

Flow battery (FB) is an electrochemical device in which the electrical energy is derived

from the chemical energy stored in the electrolytes. This chemical energy is converted to

electrical energy during discharge. The electrolytes are circulated through the cell during both

charge and discharge. In general, FBs contain two electrolytes, one to store the active materials

for negative electrode reactions and the other to store active materials for positive electrode

reactions [123]. Electrolyte solutions contain both reduced and oxidized form of reactants in

the same phase, where the relative concentrations of oxidized and reduced forms vary over the

course of charge or discharge. Due to its scalable nature, easy decoupling and refueling, flow

batteries are efficient than other storage devices [124]. The low energy density values

possessed by various redox chemistries are the major impediments for flow battery

commercialization. Identifying new electrolyte chemistries with reasonable energy densities

can make flow batteries economically viable. Energy densities of existing electrolytes can be

improved by increasing their solubility by selecting suitable solvents. Soloveichik [125]

reviewed various flow battery technologies in detail along with the technical and economic

challenges and possible remedies for the same.

5.1 Literature survey

Quinones are gaining interest as electrolytes for flow batteries in the past few years due to

their ability to transfer two electrons for a single molecule and impressive solubility

characteristics, which results in relatively high energy densities compared to other flow battery
97
chemistries. Quinones also exhibit minimal membrane crossover due to their large molecule

size and can be produced on a large scale with less expenditure compared to vanadium.

Suleyman et al. [126] showed that solubility of Quinones and the reduction potential of

Quinone redox couples can be tuned by substituting various functional groups. They used

Perdew–Burke–Ernzerhof (PBE) based DFT calculations to obtain reduction potential and

solvation free energy values. Quinones can be used as electrolytes on both sides of a flow

battery, hence a rigorous exploration of Quinone derivatives space to identify more efficient

electrolytes is of interest. Quick and robust structure-property relationships are required for

further exploration of Quinones either by substituting with a new set of functional groups or

by substituting with two or more functional groups on a single molecule. These relationships

can be useful for the computationally tractable search of potential molecules in the derivatives

space avoiding computationally expensive DFT simulations.

Group contribution (GC) approaches are well established to estimate a wide variety of

physical and chemical properties ranging from melting point to toxicity of organic molecules

[127]–[131]. These approaches assume that the organic molecules are constructed using

fragments from a predefined set of fragments and the properties of these compounds are

linearly dependent on the occurrences of each fragment. Marrero and Gani[132] proposed an

efficient multilevel group contribution approach, in which the property of interest is initially

regressed with the occurrences of first-order groups. Then the residuals are regressed with the

occurrences of second order groups. Finally, the remaining residuals are related to the

occurrences of third order groups. First order groups are simple functional groups that can form

a molecule structure such that no atom will be counted twice. Second order groups are used to

distinguish between isomers effectively. Third order groups are usually fused and non-fused

rings. Using this multilevel group contribution approach, Marrero and Gani[133] estimated

octanol/water partition coefficient and aqueous solubility of a broad range of compounds

98
ranging from C3 to C70. Correa et al. [134] proposed Analytical Solutions of Groups (ASOG)

group contribution approach to predict water activities in aqueous electrolytes.

Quantitative structure-activity or structure-property relationships (QSAR/QSPRs) are the

mathematical representations of the functional behavior between the biological activity or

chemical response of a component and its quantifiable structural information. This structural

information is denoted in the form of structural descriptors/features such as atom counts,

surface area, refractivity, etc. QSPR/QSARs are widely used in the fields of molecule design,

predictive toxicity and drug design[74] for identifying various properties of organic molecules

such as flash point[135], vapor pressure, water-air partition coefficients[136], water-octanol

partition coefficients[137], solubility[138] and toxicity[139] etc. Any QSPR/QSAR study

involves three major steps, i.e. calculating structural features (descriptors) for the predefined

molecules set, identifying suitable descriptors and obtaining an efficient association between

structural features and the property of interest. Structural features can be obtained using first

principles, theoretical models and platforms like PaDEL-Descriptor, DRAGON, OpenBabel,

etc.[74], which are specifically designed for the calculation of structural features. Selecting

suitable descriptors and obtaining robust models involve a wide range of chemometrics such

as PCA, regression tools and neural networks, etc. Yousefinejad and Hemmateenejad [77]

consolidated various chemometric methods used in both feature selection and model

development phases of QSPR studies. Once a robust QSPR is identified, the structural features

for a specified objective can be obtained in an inverse QSPR framework [140].

Various kinds of feature selection algorithms are proposed in literature i.e. classical methods

such as forward selection and backward selection [141], artificial intelligence based methods

such as genetic algorithm (GA) [117], particle swarm optimization (PSO) based approaches

[142] and dimensionality reduction based approaches such as principal component analysis etc.

Forward selection approach starts with zero descriptors and in each step, one new descriptor is

99
added based on predefined criteria until a stopping criterion is satisfied. Backward selection

approach starts with the complete set of descriptors and in each step, a new descriptor will be

removed based on predefined criteria until satisfies stopping criterion is satisfied. In stepwise

selection, a combination of both forward and backward selection at each step is shown to be

more robust. GA and PSO frameworks formulate feature selection as an optimization problem

with binary variables i.e. each variable corresponds to the decision of whether a feature should

be considered or not. Principal component based approaches obtain few linear combinations of

original descriptors, which can explain maximum variability in the data.

In the era of machine learning, due to the availability of a wide range of modeling

techniques, selecting a suitable modeling method is also crucial to obtain a robust structure

property relationship. Each modeling technique has its own advantages and disadvantages.

Multivariate linear regression can be used if the dependency of the property of interest on

selected features is anticipated to be linear [143]. Principal component regression [144], in the

case of relationships among inputs, polynomial regression and artificial neural networks

(ANN) [145] in cases where nonlinear relationship exist between selected features and property

of interest can be explored. Though ANN models can fit very complex nonlinear behavior,

interpretability and overfitting are major issues. Piecewise linear models have been shown to

mimic nonlinear behavior using piecewise linear assumptions [146]. In our earlier work [147],

piecewise linear models were used to fit the non-linear behavior in order to obtain a robust

QSPR to predict drug solubility in binary systems. We proposed a prediction error based

clustering approach in our previous work [83], which can identify the significant features as

well as operating models in a single framework.

In this work, initially, a group contribution based approach is employed to correlate the

structure of Quinone derivatives to their solvation free energy values provided in the

literature[126]. Later, various QSPR based approaches are used to correlate structural features

100
of Quinone derivatives to their solvation free energy values. A brief overview of group

contribution approaches, QSPR approaches and useful chemometric approaches are provided

in this section. In the following section, a problem specific group contribution approach is

described to obtain the solvation free energy. In section 3, three different QSPR approaches i.e.

linear, neural network and piecewise linear models are employed to obtain a robust QSPR.

Finally, this chapter concludes with comments and discussions on the efficiency of the

proposed approaches.

5.2 Group contribution approach

Group contribution (GC) approaches assume that the property of interest of a compound is

a function of a predefined set of structural fragments and it is computed by summing the

frequency of each group occurring in the molecule times its contribution [132]. The group

contribution framework to obtain the properties of interest of organic molecules is shown in

Figure 5.1. GC approach involves two steps, initially, the occurrences of each fragment are

counted and then a linear relationship is obtained between the property of interest and the

occurrences of each fragment to obtain the contribution of each fragment. The contributions

(𝐶) of all fragments are calculated using equation 1, where 𝑓(𝑋) is the property of interest of

molecule 𝑋, 𝑁𝑖 is the number of times fragment 𝑖 occurred in molecule 𝑋 and 𝐶𝑖 is the

contribution of fragment 𝑖. Now, for any new test molecule, the occurrences of each fragment

are evaluated and substituted in the linear relationship (in equation 1) to obtain the property of

interest of the test molecule.

n
f  X    N i Ci (4.1)
i 1

101
In this case study, data set [126] for the solvation free energy estimation includes three

variants of Quinones i.e. benzoquinone, naphthoquinone, and anthraquinone substituted with

18 functional groups. To differentiate the three variants of Quinones and the 18 functional

groups, we considered 41 different types of fragments, which are specific for this case study.

38 out of these 41 fragments are first-order groups, whereas the remaining 3 groups belong to

the second-order, which are useful to differentiate between the Quinone types. The data set

contains 407 data samples. Initially, the data is randomly divided into ‘model’ and ‘global test’

data sets with 80% and 20% of data samples respectively. Model data set is used to obtain

contributions of each group (i.e. model parameters) in association with K-fold (K as 5)

validation approach. In each run, the model data set is again randomly divided into K-equal

partitions and each time data in K-1 folds are used to train the model and the remaining to test.

This procedure is repeated for 100 random runs and the model parameters (contributions of all

groups) are averaged and reported as the final set of parameters in Table 5.1 along with the

performance metrics.

Figure 5.1 Group contribution approach framework

102
Table 5.1 All 41 groups that are considered for the case study along with contributions

Group Contribution Group Contribution Group Contribution


aC-H -0.8263 aC-COOH -21.1883 C-CHO -6.7621
aC-N(CH3)2 -1.127 aC-PO3H2 -28.4856 C-COOCH3 -6.3738
aC-NH2 -14.5622 aC-SO3H -14.5847 C-CF3 2.7074
aC-OCH3 0.8025 aC-NO2 -0.9337 C-CN -9.4303
aC-OH -11.5993 C-H -2.7929 C-COOH -22.5123
aC-SH 0.0749 C-N(CH3)2 -6.6628 C-PO3H2 -32.2944
aC-CH3 0.6223 C-NH2 -13.747 C-SO3H -20.3163
aC-SiH3 3.7311 C-OCH3 3.3567 C-NO2 0.0017

aC-F 3.7044 C-OH -10.9846 both C=0


on different -22.2017
aC-Cl 2.7168 C-SH 0.5817 rings

aC-C2H3 -0.5199 C-CH3 -4.2883 both C=0


side by side -27.1104
aC-CHO -7.066 C-SiH3 1.5231 of same ring

aC-COOCH3 -5.1764 C-F 1.4416 both C=0 in


opposite
aC-CF3 2.3128 C-Cl -10.7412 -16.4325
sides of
aC-CN -11.117 C-C2H3 -1.6885 same ring

103
It can be observed from the contributions values (bolded values in Table 5.1) that

substituting with PO3H2 can increase the solubility (low solvation free energy) followed by

COOH, SO3H, and NH2 as suggested in the literature [126]. It is also interesting to note that

from the second order functional group contributions (italic values in Table 5.1) having two

C=0 groups side by side in a ring can increase the solubility than having two C=0 groups

opposite to each other. This can be validated by comparing the solvation free energy values of

1,2-BQ, 1,2-NQ and 1,2-AQ variants (i.e. substituted with the functional groups) with 1,4-BQ,

1,4-NQ and 1,4-AQ variants respectively. The performance metrics of the GC approach to

estimate solvation free energy can be obtained in Table 5.4. Though considering more

fragments to differentiate isomers effectively can improve the performance of the GC

approach, the size of the data is an impediment in this case study. The major setback of GC

approaches is that the property of interest of any new molecule which contains the fragments,

which are not included in the training set cannot be evaluated.

5.3 QSPR based approaches

Identifying QSPR consists of three phases, i.e., data generation, feature selection, and model

prediction. In the data generation phase, chemical structures are converted into an accessible

form such as .mol, .smi, etc. to calculate structural feature values. Structural features or

descriptors can be obtained from first principles models[148], experimental methods and

several platforms designed for estimating structural features such as MOPAC, OpenBabel, and

PaDEL-Descriptor [74], etc. Feature selection involves both domain and mathematical

knowledge to identify the significant and independent features set that affects the property of

interest. Feature selection algorithms such as forward selection, backward selection, stepwise

regression, and evolutionary optimization approaches are mathematical ways of exploring the

most suitable feature subset to reduce the model complexity, thus avoiding overfitting of

models[75]. The model prediction is the process of identifying a robust model between the

104
significant features set and the property of interest. QSPR framework to estimate any end

property of organic molecules is depicted in Figure 5.2.

Start

Calculate the descriptors  MOPAC


 CODESSA PRO
using available tools  PADEL-descriptor

Select independent  Genetic algorithm


 Stepwise algorithm
significant features  PCA

Obtain the relationship  Linear regression


 Non-linear regression
using available tools  Neural networks

Stop

Figure 5.2 QSPR framework to estimate the property of interest of organic molecules

PaDEL-Descriptor [81] is an openly available software to compute various kinds of

structural features of molecules varying from topological parameters to chemical fingerprints.

In this case study, for QSPR estimation, the solvation free energy data of 407 Quinone variants

provided in the literature [126] is used. Structure files of all 407 Quinone variants are generated

in smiles (.smi) format and processed to obtain the structural features using PaDEL-Descriptor.

McGowan characteristic volume (McG_Vol), Molecular weight (MW), Van der Waals volume

(VABC), first ionization potentials (Si), sum of atomic polarizabilities (Apol), solvent

accessible surface area (TSA), topological polar surface area (TopoPSA), combined

polarizability (MLFER_S), excessive molar refraction (MLFER_E), Molar refractivity

(AMR), overall hydrogen bond basicity (MLFER_BH) and acidity (MLFER_A) values of the

105
molecule are found to affect solvation free energy values [72], [147], [149] hence these are

considered as structural features for this case study. Since the structural features can be of

different magnitudes, to avoid the influence of any particular variable on the model parameters,

features are scaled individually by mean centric scaling using the mean and standard deviation

of a particular feature.

5.3.1 Single linear model based QSPR

In this case study, initially, we identify the significant features using K-fold validation (K value

as 10) in conjunction with F-test. For feature selection, a linear relationship is assumed between

the 12 descriptors set and solvation free energy. The model parameters obtained in each fold

are averaged and each parameter is tested with F-test to check if it is significant or not. It is

identified that out of 12 variables, 7 variables i.e. McG_Vol, MLFER_A, MLFER_BH, MW,

MLFER_S, MLFER_E, and TopoPSA are significant. Later, a linear relationship is obtained

between the above identified significant features set and solvation free energy using ordinary

least squares associated with K-fold validation (K value as 5).

Initially, the data is randomized and divided into ‘model’ and ‘global test’ data sets with

80% and 20% of data samples respectively. Data in ‘model’ data set is randomized and equally

divided into K-partitions and each time data in K-1 partitions are used to train the model and

the trained model is tested on remaining data. Model coefficients obtained in all K-folds are

averaged for 100 random runs and reported as final model parameters. Performance metrics of

linear relationship obtained on the model data set, global test set and overall data set can be

obtained in Table 5.4. The poor performance (R2 value as 0.6395) of a single linear model

suggests that the linear behavior assumption may not be valid hence a non-linear model can be

anticipated to increase the prediction accuracy. In following subsections, neural network and

piecewise linear based models are tested to obtain robust non-linear models.

106
5.3.2 Neural network based QSPR

Neural networks are highly recommended to mimic non-linear behavior due to their ability

to capture complex functions. Jalali-Heravi et al.[150] concluded that the Levenberg-

Marquardt algorithm is more suitable for QSPR prediction compared to other training

approaches such as backpropagation and conjugate gradient algorithms. In this case study, a

Levenberg-Marquardt algorithm based neural network is trained to identify the relationship

between structural features and solvation free energy. In this case study, to identify significant

features, a backward stepwise approach[151] is used. In this approach, if n variables have to

be ranked, initially n networks with n-1 different variables have to be trained on training data

set. The nth missed out variable for which the network results in the largest error on test data

set is considered to be the most important. To identify the next most significant variable, the

current important variable is removed and the above procedure is repeated. This procedure is

continued until all n variables or the first m (<n) important variables get individual rankings.

In this case study, feature selection is carried on 12 input variables (structural features) with 1

hidden layer architecture. The ranking procedure described above is repeated for 100 random

runs and the consolidated rankings are reported in Table 5.2.

The first seven important variables are used to obtain the final network to estimate solvation

free energy. Data is randomly divided into ‘model’ and ‘global test’ data sets with 80% and

20% of data samples respectively. Initially, to obtain optimal network architecture, networks

with hidden layer sizes ranging from 1 to 7 are tested for 10 random runs. In each run, a neural

network is trained for all 7 configurations on the model data set and tested on testing data set.

The adjusted root mean squared error values of all 7 networks are averaged for all 10 runs and

the network architecture with the least averaged error on the test data set is assumed to be the

optimal network. The mean adjusted RMSE values of networks with 1 to 7 hidden layers on

test data set are 30.2237, 30.8945, 29.0444, 30.1465, 33.3646, 36.7107 and 44.5868

107
respectively, hence the network with 3 hidden layers is considered to be optimal. Once the

optimal network architecture is obtained, then a neural network with a similar architecture is

trained on model data set for 100 random runs. The network with the least root mean squared

error on the test data set is considered to be the final network to predict solvation free energy.

Prediction accuracy of the final neural network model on model data set, test data set and on

overall data is given in Table 5.4. It can be observed from Table 5.4 that neural network based

QSPR performs better than the single linear model based QSPR due to the ability of NN to

mimic the non-linear behavior.

Table 5.2 Features ranking obtained using a stepwise approach for NN-QSPR

Variable Rank Variable Rank Variable Rank Variable Rank

MLFER_A 1 MLFER_E 4 AMR 7 Apol 10

TopoPSA 2 TSA 5 MW 8 Si 11

MLFER_BH 3 MLFER_S 6 McG_Vol 9 VABC 12

5.3.3 Multiple model based QSPR

Piecewise linear models have been shown to identify non-linear behavior[79], [80] and are also

easily interpretable. Identifying piecewise linear models and their operating regions is known

as multiple model learning. In this case study, we assume solvation free energy values are

linearly dependent on the structural features with different hyperplanes in different regions. A

fuzzy clustering approach based on prediction error is used to obtain the operating models [83].

The major advantage of this approach is that both feature selection and model identification

are included in a single framework. It is interesting to note that in different operating regions

different structural features can be significant. The data is randomly divided into ‘model’ and

‘global test’ data sets with 80% and 20% of data samples respectively. In this approach,

108
initially, the number of underlying models and their true orders (i.e. significant features in each

operating region) are identified using the clustering approach [83] on model data set. Later, the

clustering procedure is initiated with true models and their orders associated with K-fold (K

value as 4) validation on model data set to obtain the final model parameters.

Details of the PE based multiple model clustering approach using statistical analysis [83]:

1. Initially, randomly generate N vectors of predefined orders with different parameter

values. Each vector with a different set of parameter values denotes a different cluster.

2. Obtain the prediction error of sample j with respect to each randomly generated cluster

i using Equation (2.10)

3. Calculate fuzzy membership of sample j with respect to cluster i using Equation (2.11)

4. Update the model parameters (cluster centers) using the gradient descent algorithm

given in Equation (2.12)

5. Calculate the prediction errors for all samples with respect to the updated models

6. Compute root mean square error using Equation (2.15)

7. Terminate if any pre-specified criteria is satisfied (number of iterations exceeds the

limit or RMSE less than the predefined limit ) and go to next step else go to step 3

8. Assign each data sample to respective clusters based on prediction errors

9. Calculate the cosine angle between each model to the others using Equation (2.16) and

merge like models

10. The models that have fewer data samples (<0.05M) are discarded and the data points

are reassigned to models that fit them best

11. Calculate the final model parameters using ordinary least squares (OLS).

Once the final set of models are obtained at the end of an iteration (step 1 to 11), each model

is tested with F-test [118] to identify whether a particular variable is significant or not.

109
12. Each model is tested using F-test, and if any variable in a particular model is identified

as insignificant then that variable will be removed thus reducing the model order.

13. If any of the models contain insignificant variables then the whole clustering approach

is restarted with a new number of models and their orders i.e. go to step 1 with modified

‘N’ and their individual orders, else report the final model orders and parameters.

In this study, the multiple model learning approach contains two stages. In the first stage, to

obtain the true number of models and their individual orders, the clustering approach described

above is used on the model data set. In the first iteration, in step 1, five models are initiated

with 13 input variables (12 scaled structural features and intercept) with random model

parameters. From the second iteration onwards, the models are initiated with the final set of

models that are obtained in the previous iteration. Once the true models and their corresponding

model orders are obtained, in the second stage, to obtain robust model parameters we use a K-

fold (K value as 5) validation based approach. Model data set is equally divided into K equal

partitions. In each fold, data in K-1 partitions are used to build the model and remaining data

to test obtained models. In K-fold validation, switching 20% of data samples each time for a

new fold results in a substantial deviation in model parameters. Hence, to obtain robust multiple

models, an iterative weight based optimization approach is used [147], in which the models

information in the previous fold is included in form of weight-based objective for the next fold.

In this approach, the final set of models in step 11 are obtained using a weight based

optimization approach for the objective specified in Equation (4.2) with  value as 10 and the

final model parameters  C prev  obtained in the previous fold as an initial guess. In case of the

first fold of the first phase, both in step 1 (initiation of the models) and in step 11 (for the

optimization problem) the model parameters obtained from the first stage (i.e. identification of

the true number of models and model orders) are used. This procedure is repeated until the

110
respective models in all the folds are relatively close, which can be measured using the

similarity metric   proposed in our earlier work [147]. In the second stage, models are

neither merged nor discarded (i.e. step 9 and 10) and statistical testing (i.e. step 12 and 13) is

also avoided since clustering procedure in this stage is assumed to start with the true number

of models and their orders.

  Ntr 2 2
Weight based objective: min     yi  yipred      C  C prev   (4.2)
C
  i 1  

Similarity metric:   max i , k  ; i  N ; k  K (4.3)

where    i, j 
i ,k  max min  k ' ; j  N ; k '  K  k '  k ; (4.4)

The pseudo code for the identification of multiple model parameters (second stage) is as follows:

Do while:

For fold in 1 to K-folds:

Divide the whole data into respective training and testing data sets

Initialize the clusters with the final model parameters provided in the previous fold.

Follow the clustering procedure provided above from step 2 to step 8

Obtain the final model parameters using a weight based optimization approach for the

objective specified in Equation (4.2) with the final model parameters obtained in the

previous fold. End

Obtain the similarity metric of the multiple models obtained in all folds

If the similarity metric is smaller than the tolerance or larger than the similarity metric in the

previous iteration, then terminate and report the averaged model parameters respectively over

all the K-folds as the final set of model parameters, else, continue. End

111
In our earlier work [147], a prediction error based K-nearest neighbors testing approach is

proposed in order to identify an appropriate model for a new test sample. In this case study, we

used a weighted prediction error based K-nearest neighbors method. The weights provided are

inversely proportional to the distance from the test sample to the neighbor i.e. a neighbor, which

is nearest to the test sample will have more impact than a neighbor which is far from the test

sample. The clustering procedure is initiated assuming five linear models of order 13 (12 scaled

structural features and intercept) and it converges to three models of different orders.

It is interesting to note that in different operating regions of feature space, different features

are found to be significant. The details of the converged models along with the number of data

points (of model data set) that belong to each model is given in Table 5.3. It is interesting to

note that VABC, MW, and AMR are found to be insignificant in the whole feature space, which

is also validated by the neural network. It can be observed from Table 5.3 that though

MLFER_A, MLFER_BH, MLFER_S, and MLFER_E are found to be more significant in the

features set, no feature is found to be significant in the complete feature space. It is interesting

to note from the coefficients reported in Table 5.3 that solubility is directly proportional to

hydrogen bond acidity and combined polarizability and inversely proportional to the excessive

molar refraction in the complete feature space but with different proportionality constants.

All the data samples are associated with the final set of models obtained in the above

iterative procedure based on the final prediction error. Association of these data samples is

further useful to select a suitable model for any new test sample (i.e. for global test set or for

any novel molecule). Prediction accuracy of the final set of multiple models on model data set,

test data set and on overall data is given in Table 5.4. It can be seen from the table that multiple

models perform better than any other approach. The adjusted RMSE and R2 values demonstrate

that the final set of multiple linear models can be used to predict the solvation free energy of

Quinone molecules. Due to the capability of neural networks in identifying the non-linear

112
dynamics, the neural network (NN) based QSPR approach is shown to be better than the OLS

approach; however, both approaches were not robust to estimate solvation free energy for a

wide range of Quinone derivatives. It is interesting to note that though the estimates of the GC

method on model data set is slightly better than the NN based approach, NN estimates are better

on the global test set. This can be attributed to the overfitting of contributions in the group

contribution approach.

The solvation free energy values obtained using several approaches are plotted against their

original values in Figure 5.3. It can be verified from the figure that multiple linear models have

better estimates compared to other approaches throughout the range of solvation free energy

values, especially for high solubility molecules (samples having solvation free energy values

ranging between -150 to -310 kJ/mol).

Table 5.3 Details of the final set of multiple models converged

Model Active features and their coefficients in final averaged models No. of samples

Apol, Si, McG_Vol, MLFER_BH, MLFER_S, MLFER_E


1 127
[-153.670 51.156 101.901 -30.166 -8.982 24.902]
MLFER_A, MLFER_BH, MLFER_S, TSA
2 22
[ -22.249 35.523 -56.263 57.925]
Si, MLFER_A, MLFER_E, TopoPSA
3 176
[-9.876 -5.696 16.634 -56.429]

113
Table 5.4 Performance metrics of several approaches for solvation free energy estimation
2 2
RMSE Adj. RMSE R value Adj. R
Model 21.5371 23.0393 0.7806 0.7488
Group
G test 17.2844 24.4439 0.8728 0.7424
Contribution
Over all 20.7505 21.8820 0.8019 0.7796
Model 28.5975 28.9561 0.6132 0.6034
Single linear G test 25.4517 26.7921 0.7242 0.6940
Over all 27.9921 28.2714 0.6395 0.6322
Model 22.3767 23.4077 0.7632 0.7408
Neural
G test 16.2358 20.0070 0.8878 0.8285
network
Over all 21.2825 22.0546 0.7916 0.7762
Model 12.1045 12.4340 0.9307 0.9269
Multiple
G test 15.4585 17.3628 0.8983 0.8712
linear
Over all 12.8508 13.1279 0.9240 0.9207

Figure 5.3 Solvation free energy original vs predicted using several approaches

114
5.4 Conclusion

In this study, various machine learning based QSPR approaches along with the GC approach

have been tested to predict solvation free energy of Quinone derivatives for further exploration

of Quinones in an inverse optimization framework. In this study, we used adjusted root mean

squared error and adjusted R2 values for an unbiased comparison of various approaches since

these approaches are parameter sensitive i.e. a neural network with more hidden layers can

capture highly complex behavior. It is observed from the reported metrics that multiple models

based QSPR approach performs better than the other approaches. It is identified from the GC

approach that substituting hydrogen atoms with groups like PO3H2, COOH, etc. can increase

the solubility of Quinones. Group contribution approach estimates are restricted by the number

of groups considered for this case study. These estimates can be further improved by assuming

non-linear contributions of groups. It can be observed from the QSPR case studies, that

structural features like overall hydrogen bond basicity (MLFER_BH) and acidity (MLFER_A),

combined polarizability (MLFER_S) and excessive molar refraction (MLFER_E) of the solute

affect the aqueous solubility more compared to the other structural features. This work can be

further extended to obtain a robust model to estimate reduction potential of Quinone molecules

and by using these two models other Quinone variants can be explored in an inverse multi-

objective optimization framework to obtain potential molecules for flow battery applications.

In this chapter, we have tested the efficacy of the clustering approach proposed in Chapter

4 to obtain the underlying partitions and significant features in each partition in a single

framework to improve the performance of prediction error based clustering approaches. Taking

note of the superior performance of piecewise linear models compared to the single linear and

neural network based approaches for model identification, in the next chapter, we propose a

piecewise SVM classifier based on prediction error to obtain nonlinear boundaries for binary

classification problems.

115
CHAPTER 6

An adaptive prediction error based multiple model SVM

classifier for binary classification problems

In the recent years, usage of machine learning is exponentially increasing in several fields.

Use of machine learning for binary classification problems is of particular interest in multiple

areas. For example, predicting whether a patient is suffering from a disease using image

recognition or whether a loan can be issued to an applicant based on his/her past credit history

are examples of binary classification problems. Some of the widely used classifiers for general

classification problems (including the binary variety) can be seen in Figure 6.1. Support vector

machines (SVMs) [152] are useful for modeling both linear and non-linear boundaries between

the classes using kernels but the performance of SVM depends on the kernels chosen [153] and

there exists no heuristics for selecting a suitable kernel for a given problem. Neural networks

have the flexibility to mimic highly non-linear functional behavior but the performance is

highly dependent on the selected network architecture and algorithmic parameters [154].

Interpretability is another drawback for neural network based classifiers.

6.1 Literature survey

Decision trees [155] and Random forests [156] are logic based classifiers, where the data is

divided into classes based on certain conditions at each node. Though logic based classifiers

have a simple design and good interpretability they are very sensitive to the data. Piecewise

classifiers [157] assumes that the data in different regions of feature space behave differently

and hence a classifier is derived for each subregion. Most of the piecewise classifiers assume

116
the number of regions as known, which may not be true in most real-life case studies. Logistic

regression [158] is a statistical classifier, which provides the probability for a data sample to

belong to a certain class. Though it takes less training time and provides good interpretability,

it underperforms in case of complex behavior between features and output. Kotsiantis et al.

[159] provided a detailed review of various classification approaches.

Multiple classifier systems (ensemble of classifiers) have also been proposed for various

applications based on the motivation that a pool of classifiers together can give better

predictions than a standalone classifier. Any multiple classifier system such as bagging,

random subspace, and class switching, etc. initially generate modified training data sets based

on certain criteria. Different classifiers are then built based on these data sets and the results

are combined them into a final decision, usually based on majority voting [160]. Bagging [161]

randomly generates a predefined number of training sets and each data set is used to build a

classifier. Random subspace [162] trains a predefined number of classifiers with a different

subset of features each time. It is anticipated that the classification boundaries are combined in

some way to mimic the original classification boundary. Nanni and Lumini [160], [163]

examined various ensemble classifier approaches to predict credit score and biometric

verification and concluded that random subspace approach is efficient than the other

approaches. Several attempts have made to combine SVM with various techniques such as

decision trees [164], [165], Markov models [166] and particle swarm optimization [167].

Ghodselahi [168] proposed a hybrid classifier combining SVM classification approach and

fuzzy C-means clustering algorithm to predict credit score. Initially, the data is clustered into

a predefined number of clusters using FCM approach and an SVM classifier is trained on the

data in each cluster and these clusters are then combined based on a weighted fusion agents

approach. Weights for each agent (classifier) are computed based on the membership of the

point to the corresponding cluster. Rahman and Tasnim [169] provided a comprehensive

117
review of various commonly used ensemble classifiers along with some application driven

classifiers.

Sklansky and Michelotti [157] introduced the idea of the piecewise linear classifier using a

two-stage approach, in which they first identify clusters of close-opposed pairs of data samples

and then a decision surface is constructed using adjacency matrix and switching theory. In the

past few decades, piecewise linear (PWL) classifiers have been applied for a broad range of

applications such as intelligent cameras, autonomous mobile robots, portable devices,

automated visual surveillance systems, monitoring systems, and industrial vision systems

[170], etc. to approximate the non-linear decision boundary. The advantages of PWL classifiers

are: easy implementation, low memory requirements and real-time classification [171].

Currently existing PWL classifiers can be categorized into two types. The first category of

methods [157], [172]–[174] follow a two-stage procedure, in which, in the initial stage, they

obtain a classifier in each segment and then in the final stage, they combine the identified

hyperplanes to obtain the final decision boundary. Of these approaches, Bagirov et al. [174]

approach of applying a max-min separability algorithm only on indeterminate regions has

drawn considerable attention due to its simple design and reduced computational complexity.

Initially, a hyper-box is identified for each class and then in the regions where classes overlap

a max-min separability is used to separate these data into classes.

The second category of methods [175]–[177] solve a single optimization problem to obtain

the whole piecewise linear boundary. These methods assume that the number of models are

known a priori. The optimization problem formulated in such methods are very complex thus

resulting in local optimal solutions, a major drawback of these methods. Astorino and Gaudioso

[175] proposed a polyhedral based binary classifier, in which data from a particular class is

enclosed using a predefined number of piecewise classifiers and data samples outside the

polyhedral are assumed to belong to the other classes. Bagirov [176] proposed a max-min

118
separability approach, which is theoretically proven to obtain the global optimal solution for

classification provided the classes are completely separable, i.e., there exists no overlap in the

feature space of any two classes. Huang et al. [177] theoretically proved that any PWL

boundary can be identified using PWL feature mapping. They proposed two PWL classifier

approaches by combining the idea of PWL mapping with the traditional SVM [152] and least

squares SVM [178] and compared the performance with the existing literature approaches on

various synthetic and real data sets. The performance of these algorithms depend on the non-

linear parameters that are selected. Prior knowledge will allow these parameters to be selected

appropriately, tuning the parameters randomly when no prior information is available is usually

difficult. Most of the existing PWL classifiers are designed for binary classification problems.

Classifiers designed for binary classification problems can be further extended to multi-class

by decomposing the multi-class problems into a series of binary classification problems [179].

Kostin [170] proposed a binary proportion based tree approach to extend binary piecewise

classifiers to solve multi-class problems.

In the case of function approximation, several algorithms have been proposed based on both

Euclidian and prediction error based approaches to identify non-linear behavior using

piecewise linear assumptions [79], [80]. In this chapter, we propose a new multiple model

piecewise SVM approach for binary classification problems and test this approach on various

synthetic and real data sets. The chapter is organized as following – initially, a brief

introduction to classification and various techniques for solving classification problems with

linear, non-linear and piecewise linear boundaries are provided. In section 2, the mathematical

representation of traditional SVM for a binary classification problem is provided. In section 3,

the details of proposed piecewise SVM and a testing strategy are provided. In section 4, the

performance of the proposed algorithm on both synthetic and real data sets are provided.

Finally, this chapter is concluded with comments on the efficiency of the proposed approach.

119
Classification

Statistical Piecewise classifiers


SVM Neural networks Decision trees
classifiers

X2
X2
Y X1
X1>0.5 X1<0.5
X1

Y X2 No
X2
X2>0.2 X2<0.2
X X1
X1 Yes No

Figure 6.1 Some of the widely used machine learning techniques for classification problems

120
6.2 Support vector machines

Support vector machine (SVM) identifies a hyperplane (linear boundary) such that the

distance from the plane to the nearest data samples in both classes are maximum. This

maximization problem is scaled and converted to a minimization problem as the following:

1
min w s.t. yi  wxi  b   1 i  1...N (5.1)
w ,b 2

Where 𝑥𝑖 is the input vector of sample 𝑖 and 𝑦𝑖 is the output of that sample, which is either 1

or -1 and 𝑤 is the coefficients vector representing the separating hyper plane and 𝑏 is the bias.

This problem is reformulated using Lagrangian multipliers approach:

N
w    i  yi  wxi  b   1
1
min s.t.  i  0 i  1...N (5.2)
w ,b 2 i 1

At the stationary point, by equating the gradient for the primal problem with respect to both w

and b to zero.

f N N
 w    i yi xi  0  w    i yi xi (5.3)
w i 1 i 1

f N
   i yi  0 (5.4)
b i 1

Substitute these values in Equation (5.2) and solve the dual problem as following:

N N N N
max i  i j yi y j xi' x j s.t. i  0 & i yi  0 i  1...N (5.5)
i
i 1 i 1 j 1 i 1

The data samples with non-zero 𝛼𝑖 in the optimal solution are called as support vectors and

the support vector plane 𝑤 is estimated using the support vectors. In the case of linearly non-

separable data, a soft margin SVM will be identified such that the number of misclassifications

are minimum.

121
The optimization problem for soft margin SVM is formulated as following:

N
1
min w    i s.t. yi  wxi  b   1  i & i  0 i  1...N (5.6)
w,b , 2
i 1

To relax the constraints in Equation(5.1), positive slack variables i  are used and

minimization of slack variables is introduced in the objective function using a weight  . If 

is considered as infinite then the problem tends to reach the traditional SVM solved earlier in

Equation(5.1). As the  value tends to zero, the width of the soft margin will be increased and

the misclassifications will be neglected. This problem is reformulated using Lagrangian

multipliers approach as following:

N N N
w  C  i   i  yi  wxi  b   1  i     ii
1
min s.t.  i , i , i  0 i  1...N (5.7)
w,b 2 i 1 i 1 i 1

At the stationary point, by equating gradient for the primal with respect to 𝑤, 𝑏 and  to zero

results the following:

N N
w    i yi xi ;  y i i  0;  i    i ; s.t.  i , i , i  0 i  1...N (5.8)
i 1 i 1

Substitute these values in the equation (5.7) and solve the dual problem as following:

N N N N
max i  i j yi y j xi' x j s.t.   i  0 & i yi  0 i  1...N (5.9)
i
i 1 i 1 j 1 i 1

The data samples with non-zero 𝛼𝑖 in the optimal solution are called as support vectors and 

are termed as box constraints.

6.3 Multiple model based SVM

In our earlier work [78], a prediction error based soft clustering approach is proposed to

identify the non-linear behavior using piecewise linear assumption for function approximation

problems. In this clustering algorithm [78], a least squares objective for update of multiple

122
models is formulated such that the weights for the data samples are proportional to prediction

errors with respect to each model. It is interesting to note that in the case of function

approximation, all the data samples in training set will be considered to obtain the models,

whereas in the case of classification the data samples at boundaries alone (implicitly) will be

considered to obtain the classification boundary. This fact constrains us from modeling the

multiple model classifier in a complete fuzzy framework. In this section, a new hybrid multiple

model based SVM approach is proposed by combining the ideas of soft and hard clustering

procedures. In this approach during update of clusters, it is assumed that a data sample belongs

to all the models that have classified the data sample correctly and will be used for further

update of those models i.e. membership for that data sample for all such models will be

considered as one and for remaining models as zero (fuzzy clustering idea). For the data

samples that are wrongly classified by all the models, these data samples are assigned to the

models that contain the nearest support vector to the data sample i.e. for all the wrongly

classified data samples, membership will be one only for one model and zero for the remaining

models (hard clustering). Once the cluster update is terminated, data samples are assigned to

final SVM models based on hard clustering procedure.

In this approach, the true number of models is not assumed beforehand but identified in an

iterative fashion using the accuracy of the classifier based on the test data set. To avoid local

optimal convergence with respect to the number of models, for each 𝑁 value we test the

accuracy for the next two values i.e. 𝑁 + 1 and 𝑁 + 2 before termination. The pseudo code of

the proposed approach is as follows:

Start Define the algorithmic parameters such as the maximum number of multiple

models (𝑁𝑚𝑎𝑥 ), and the maximum number 𝑀𝑖𝑡𝑒𝑟 of cluster updates.

Initialize 𝑁 with 1

While (𝑁 ≤ 𝑁𝑚𝑎𝑥 ):

123
Divide the training data randomly into 𝑁 equal partitions

For 𝑖𝑡𝑒𝑟 in 1 to 𝑀𝑖𝑡𝑒𝑟 :

Obtain SVM models on the data samples that are associated with each model (partition).

Calculate the prediction error for each data sample with respect to all 𝑁 SVM models

PEi , j  Erri , j  max   , Di , j  i  1 n, j  1 N (5.10)

0
Erri , j  
0
if Yi ,pj  Yi
if Y  Yi
p
i, j
 
; Di , j  min Disv, j ; Disv, j  xi  x j , sv ;
2
(5.11)

Where 𝜀 is a small value to restrict the error to not to be zero for wrongly classified data

points, ̅̅̅̅̅
𝑠𝑣 is the distance vector from data sample 𝑖 to all the support vectors of model
𝐷𝑖,𝑗

𝑗 and ̅̅̅̅̅
𝑥𝑗,𝑠𝑣 is the input features corresponding to all the support vectors of model 𝑗.

Calculate the membership for each data sample with respect to all 𝑁 SVM models

1 j : PEi , j  0 

else 
 
 if min PEi  0
0
i , j 
 
(5.12)
j : PEi , j  min PEi 
1
  
 if min PEi  0
0 else 

Assign the data samples to each model, which have the membership value as 1. End

To assign the data samples to final SVM models modify the membership values of all the

correctly classified data points as following.

1
i , j  
 ;
j : Di , j  min Di
 
if min PEi  0 i  1 n (5.13)
 0 else

Reassign the data samples to each model, which have the membership value as 1.

Obtain the accuracy of given models on testing data set using any testing strategy

If accuracy values for two consecutive number of models are less than the earlier then

break. Else, increase 𝑁 by 1. End

Report the true number of models and their corresponding SVM models. Stop

124
In the proposed algorithm, the boundaries between the final hyperplanes (SVM models) are

not explicitly identified due to the complexity involved; instead, we use a non-parametric

testing approach to identify a suitable hyperplane. We propose a weighted 𝐾 nearest neighbours

(WKNN) approach to identify the model for classifying any new data sample. The weight for

a neighbour is inversely proportional to the distance from the data sample to the neighbour.

Initially, calculate distances from the test sample to all the samples in the training data and

obtain the 𝐾 nearest samples along with the distances. Next, calculate the prediction errors for

all 𝐾 nearest samples with respect to each model. Finally, obtain the overall error for each

model using the following expression and select the model with the least overall error for

classifying the test sample.

where Wk   xk  xtest 
K
1
OE j   Wk PEk , j j  1 N true ; 2
(5.14)
k 1

6.4 Evaluation of proposed binary classifier

In this section, we evaluate the performance of the proposed approach on various synthetic

and real data sets. Algorithmic parameters for all the case studies are fixed as follows – the

maximum number of piecewise linear models (𝑁𝑚𝑎𝑥 ) is fixed to 10 (interestingly all the case

studies terminated earlier i.e. the true number of piecewise linear models are less than 10), the

maximum number of iterations for cluster update is set to 100 and 𝐾 value for the proposed

WKNN testing approach is fixed to 15.

6.4.1 Synthetic case studies

In this section, we report computational studies that test the efficacy of the proposed

approach using a two-layer testing approach on two synthetic case studies in which the data is

divided using a polynomial boundary. The objective for these tests is to verify if the proposed

piecewise SVM can identify an appropriate number of hyperplanes that can mimic the non-

125
linear boundary and corresponding model parameters without any tuning parameters and prior

assumptions about the system. Initially, data is randomly divided into ‘model’ and ‘global test’

data sets with 80% and 20% of data samples respectively. Model data set is used to obtain the

piecewise SVM models in association with K-fold (K as 4) validation approach. In each run,

the model data set is again randomly divided into K-equal partitions and each time data in K-1

folds are used to train the model and the remaining to test. The efficacy of obtained models is

tested on both K-fold test data set and global test set using the proposed testing approach and

accuracy values averaged over all the K-folds are reported.

6.4.1.1 Case study 1 (second-order polynomial)

This synthetic case study contains two input variables and the boundary between the two

classes is a second order polynomial. We obtain the non-linear boundary using piecewise SVM

without any kernel assumptions using the proposed algorithm. 1000 samples of input features

𝑥1 and 𝑥2 are randomly generated using uniform distribution in the range of [-5 5]. If a

particular data sample is in the positive half space for the polynomial 𝑥12 − 𝑥2 = 5, then it

belongs to class 1 otherwise class 2. Out of the 1000 random samples, 566 belongs to class 1

and remaining 434 to class 2.

The distribution of data and the averaged accuracy on K-fold train, test, and global tests are

provided in Figure 6.2. It can be observed from the bar chart that the averaged accuracies for

K-fold test set are 0.6288, 0.9175, 0.9125, 0.9388, 0.8875, and 0.8988 for number of models

being 1 to 6 respectively hence it is concluded that the non-linear boundary obtained using four

piecewise linear models is the most optimal solution for the given data. It can be observed from

averaged accuracies of K-fold test set and global test set (0.9262 and 0.9162 for a number of

models as 2 and 3 respectively) that the second order polynomial curvature can be better

expressed using 2 models compared to three for the given data.

126
Figure 6.2 Data distribution and averaged accuracy versus number of models for case study 1

127
6.4.1.2 Case study 2 (third-order polynomial)

This synthetic case study contains three input variables and the boundary between the two

classes is a third order polynomial. We obtain the non-linear boundary using piecewise SVM

without any kernel assumptions using the proposed algorithm. 2000 samples of input

features 𝑥1 , 𝑥2 and 𝑥3 are randomly generated using uniform distribution in the range of [-5 5].

If a particular data sample is in the positive half space for the polynomial 𝑥13 + 𝑥22 − 2𝑥32 +

2 𝑥1 𝑥2 𝑥3 + 4 𝑥1 𝑥3 − 3 𝑥1 + 2𝑥2 = 0, then it belongs to class 1, otherwise class 2. Out of the

2000 random samples, 1160 belongs to class 1 and remaining 840 to class 2. The distribution

of data and the averaged accuracy of K-fold train, test, and global tests are provided in Figure

6.3. It can be observed from the bar chart that the averaged accuracies for K-fold test set are

0.6594, 0.9175, 0.9200, 0.9156 and 0.9200 for a number of models being 1 to 5 respectively

hence it is concluded that three piecewise linear models are sufficient to mimic the non-linear

boundary between the two classes.

6.4.2 Real-life case studies

The efficacy of the proposed approach is tested on some real-life data sets obtained from

the UCI machine learning repository [180]. We compare the performance of the proposed

approach with some of the reported accuracies in the literature. Some of the data sets are

already divided into the train and test sets, and for the remaining data sets, to be consistent with

literature methods, we randomly divide the data into train and test sets with the same number

of data samples as in the literature. Accuracy values for test data of different data sets using

various approaches are reported in Table 6.1. For the sake of brevity, we only report the

accuracy values with the true number of models identified using the proposed approach. The

accuracy values reported for the existing approaches are obtained from Huang et al. [177]. It

128
can be observed from the table that the proposed approach is efficient compared to other

approaches on real-life data sets.

Table 6.1 Accuracy of various approaches on test data set of real-life data sets

No. of samples Adab Huang et al. [177] Proposed


Data set n kNN 𝑁𝑡𝑟𝑢𝑒
Train Test oost pwl-svm pwl-csvm approach
Monk1 6 124 432 0.828 0.692 0.750 0.736 0.833 7
Monk2 6 169 432 0.815 0.604 0.765 0.769 0.771 3
Monk3 6 122 432 0.824 0.940 0.972 0.972 0.889 2
Haberman 3 153 153 0.673 0.765 0.765 0.758 0.791 4
Ionosphere 33 176 175 0.857 0.867 0.857 0.829 0.880 4

6.5 Conclusions

In this work, a prediction error based piecewise SVM approach to identify the non-linear

boundary between both classes without any assumptions about the system is proposed. The

proposed algorithm uses a combination of soft and hard membership in its update rule. The

efficacy of the proposed approach is tested on various synthetic and real-life case studies. A

two-layer testing approach along with K-fold validation allows the algorithm to identify robust

models. It can be observed from the accuracy values on real-life case studies that the proposed

algorithm compares favourably with the existing approaches. It is interesting to note that for a

second order polynomial case study increasing the number of piecewise models from two to

three resulted in a decreased accuracy but further increasing to four resulted in a better solution.

It should be noted that increasing the number of piecewise models increases the training and

testing time and hence there is a trade-off between increased accuracy and computational effort.

129
Figure 6.3 Data distribution and averaged accuracy versus number of models for case study 2

130
CHAPTER 7

Conclusions

In this chapter, we provide key observations summarizing the work done in this thesis.

7.1 Incorporation of process information in the PCA framework

In chapter 2, we proposed model identification schemes to incorporate available process

information in the PCA framework. The efficacy of the proposed approaches are tested on

several case studies and results suggest an improvement over conventional PCA in terms of

better identification of the true underlying models. Further, models obtained using proposed

approaches are found to perform better than traditional PCA models for fault identification.

7.2 Prediction of drug solubility in binary solvent systems

In chapter 3, a generalized Jouyban-Acree model [15] is used to predict drug solubility of a

solute (i.e. drug molecule) in a binary solvent system, if pure solubility values in both solvents

are known. This original model cannot be used to predict solubility for systems where the

model parameters are not estimated earlier. In this work, we generalize model parameters as a

function of the structural features of compounds involved in the system. Once these generalized

models are estimated, information about structural features and pure solubility values can be

used to predict solubility for any new solute and mixed solvent system at a given temperature.

In essence, generalizing model parameters as a function of structural features provides the

flexibility to extrapolate the functional behavior of drug solubility with respect to their

structural features. The framework for generalization of the Jouyban-Acree model [15] using

131
Machine learning
algorithms

Jouyban-Acree
model

Figure 7.1 Generalization of first principles model using machine learning approaches to predict drug solubility

132
machine learning approaches is provided in Figure 7.1. Genetic algorithm is used to identify

significant features. It is assumed that solubility values are piecewise linearly dependent on

structural features and model coefficients are identified using a modified PE based clustering

algorithm. A two-layer testing approach is used to test the robustness of obtained models. A

comparison of MPD values obtained using the final set of multiple models for various binary

systems with the existing approaches suggests that the final set of models can be used to predict

solubility of new drugs in a wide variety of binary solvent systems.

7.3 Prediction error based clustering approach with statistical analysis

In chapter 4, we propose a clustering approach to identify the input partitions and significant

features in each partition along with the individual model parameters. The proposed clustering

approach does not assume any information regarding the number of models or model orders.

The proposed approach is examined on various benchmark case studies to demonstrate that

true model orders can be identified in each of the partitions. It is also noted that the efficacy of

the clustering approach is increased due to the removal of insignificant variables in each phase.

The proposed approach also provides very interesting insights about the process such as

delayed response, redundant variables, etc.

7.4 Prediction of solvation free energy of Quinone derivatives

Quinone derivatives are in demand as electrolytes for flow battery technology due to their

ability to transfer two electrons which results in high energy densities compared to the well-

established vanadium redox flow battery technology. In order to explore Quinone derivatives

with multiple functional groups in an inverse optimization framework, we need to establish

robust and quick structure-property based relationships to predict solubility, reduction

potential, etc. In chapter 5, we demonstrated the efficacy of several structure property based

relationships such as group contribution and QSPR approaches to predict solvation free energy

133
of Quinone derivatives. Though group contribution approach proved to be adequate when

compared to single linear and neural network based QSPR, the applicability of GC approach is

limited to the compounds that consist of only groups from a predefined set. Multiple models

based QSPR proved to be efficient when compared to all the other approaches due to their

ability to operate with different model structures (i.e. different significant features) in different

partitions. The structure-property relationship frameworks used in this work are depicted in

Figure 7.2. Though these frameworks are used to estimate solvation free energy of Quinones

in this work, these are general in nature and can be useful in other problems dealing with

structure-property predictions.

7.5 Piecewise linear SVM:

SVMs are one of the widely used binary classifiers due to their capability in identifying

complex boundaries between the classes using kernels. In chapter 6, we proposed a piecewise

SVM to identify non-linear boundaries in binary classification problems. The proposed

approach is tested on both synthetic and real-life case studies to show the efficacy of piecewise

SVM models in mimicking boundaries characterized by polynomial functions of second and

third order and other realistic non-linear functions. Prediction accuracies suggest that the

proposed PE based piecewise SVM is better than the existing literature approaches. Further,

the proposed approach does not require any prior knowledge about the data and hence, useful

for a wide range of applications.

134
𝑛

𝑓(𝑋) = 𝐶0 + 𝑁𝑖 𝐶𝑖

GC
𝑖=1

-F
-Cl
Linear
-C2H3

-CHO

MW, TSA, HOMO, Nonlinear


QSPR
IP, atom and bond
count etc.
Piecewise
linear

Figure 7.2 Various structure-property relationship based frameworks to estimate the properties of chemical compounds

135
7.6 Future scope

In this section, we have highlighted some of the future prospects based on the work done in

this thesis.

 Initially, we discussed various approaches to incorporate process information in case

of linear model identification using PCA. Incorporating such information in a

suitable framework for non-linear model identification is of future interest.

 In case of prediction of drug solubility, the Jouyban-Acree model [15] is designed

assuming the solubility of the solute in pure solvents are known. This information

will not be available for novel drugs, which are yet to be synthesized. So, obtaining

a generalized QSPR to estimate pure solubility values will be useful for designing

novel drugs in a computational framework.

 In case of prediction error based clustering approach with statistical analysis, we

initialize the models with a sufficiently high number and merge like models to

identify the true number of models. An alternative to this approach is to identify the

true number of models in an incremental fashion based on the prediction errors of

the obtained models on test data set in each iteration.

 A robust QSPR has been identified to predict solvation free energy of Quinone

derivatives. Obtaining structure-property relationship to identify other properties of

interest such as reduction potential will be beneficial to explore Quinone derivatives

in an inverse multi-objective optimization framework.

 Incorporation of domain knowledge in classification problems is a promising future

area for research.

136
REFERENCES

[1] J. H. Lee, J. Shin, and M. J. Realff, “Machine learning: Overview of the recent
progresses and implications for the process systems engineering field,” Comput. Chem.
Eng., vol. 114, pp. 111–121, 2018.
[2] Y. Han, Q. Zeng, Z. Geng, and Q. Zhu, “Energy management and optimization modeling
based on a novel fuzzy extreme learning machine: Case study of complex petrochemical
industries,” Energy Convers. Manag., vol. 165, pp. 163–171, 2018.
[3] R. Ramprasad, R. Batra, G. Pilania, A. Mannodi-Kanakkithodi, and C. Kim, “Machine
learning in materials informatics: recent applications and prospects,” npj Comput.
Mater., vol. 3, no. 1, p. 54, 2017.
[4] J. Lee, H. Davari, J. Singh, and V. Pandhare, “Industrial Artificial Intelligence for
industry 4.0-based manufacturing systems,” Manuf. Lett., vol. 18, pp. 20–23, 2018.
[5] T. B. Trafalis and H. Ince, “Support vector machine for regression and applications to
financial forecasting,” in Proceedings of the IEEE-INNS-ENNS International Joint
Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and
Perspectives for the New Millennium, 2000, vol. 6, pp. 348–353 vol.6.
[6] A. Lavecchia, “Machine-learning approaches in drug discovery: methods and
applications,” Drug Discov. Today, vol. 20, no. 3, pp. 318–331, 2015.
[7] M. W. Libbrecht and W. S. Noble, “Machine learning applications in genetics and
genomics,” Nat. Rev. Genet., vol. 16, p. 321, May 2015.
[8] P. Czop, G. Kost, D. Sławik, and G. Wszołek, “Formulation and identification of first-
principle data-driven models,” J. Achiev. Mater. Manuf. Eng., vol. 44, no. 2, pp. 179–
186, 2011.
[9] M. von Stosch, R. Oliveira, J. Peres, and S. Feyo de Azevedo, “Hybrid semi-parametric
modeling in process systems engineering: Past, present and future,” Comput. Chem.
Eng., vol. 60, pp. 86–101, 2014.
[10] W. H. Joerding and J. L. Meador, “Encoding a priori information in feedforward
networks,” Neural Networks, vol. 4, no. 6, pp. 847–856, 1991.
[11] D. C. Psichogios and L. H. Ungar, “A hybrid neural network‐first principles approach
to process modeling,” AIChE J., vol. 38, no. 10, pp. 1499–1511, 1992.
[12] H.-T. Su, N. Bhat, P. A. Minderman, and T. J. McAvoy, “Integrating neural networks
with first principles models for dynamic modeling,” in Dynamics and Control of
Chemical Reactors, Distillation Columns and Batch Processes, Elsevier, 1993, pp. 327–
332.
[13] S. Milanic, S. Strmcnik, D. Sel, N. Hvala, and R. Karba, “Incorporating prior knowledge
into artificial neural networks—an industrial case study,” Neurocomputing, vol. 62, pp.
131–151, 2004.
[14] O. Kahrs and W. Marquardt, “The validity domain of hybrid models and its application
in process optimization,” Chem. Eng. Process. Process Intensif., vol. 46, no. 11, pp.
1054–1066, 2007.

137
[15] A. Jouyban-Gharamaleki and W. E. Acree Jr, “Comparison of models for describing
multiple peaks in solubility profiles,” Int. J. Pharm., vol. 167, no. 1, pp. 177–182, 1998.
[16] S. Chen, S. A. Billings, and P. M. Grant, “Non-linear system identification using neural
networks,” Int. J. Control, vol. 51, no. 6, pp. 1191–1214, 1990.
[17] K. S. Narendra and K. Parthasarathy, “Identification and control of dynamical systems
using neural networks,” IEEE Trans. Neural Networks, vol. 1, no. 1, pp. 4–27, 1990.
[18] W. Xiong, L. Chen, F. Liu, and B. Xu, “Multiple model identification for a high purity
distillation column process based on EM algorithm,” Math. Probl. Eng., vol. 2014, 2014.
[19] B. Zhang and Z. Mao, “Modeling and control of Wiener systems using multiple models
and neural networks: application to a simulated pH process,” Ind. Eng. Chem. Res., vol.
55, no. 38, pp. 10147–10159, 2016.
[20] S. W. Choi, C. Lee, J.-M. Lee, J. H. Park, and I.-B. Lee, “Fault detection and
identification of nonlinear processes based on kernel PCA,” Chemom. Intell. Lab. Syst.,
vol. 75, no. 1, pp. 55–67, 2005.
[21] U. Kruger, Y. Zhou, and G. W. Irwin, “Improved principal component monitoring of
large-scale processes,” J. Process Control, vol. 14, no. 8, pp. 879–888, 2004.
[22] J.-M. Lee, C. Yoo, S. W. Choi, P. A. Vanrolleghem, and I.-B. Lee, “Nonlinear process
monitoring using kernel principal component analysis,” Chem. Eng. Sci., vol. 59, no. 1,
pp. 223–234, 2004.
[23] M. R. Maurya, R. Rengaswamy, and V. Venkatasubramanian, “Fault diagnosis by
qualitative trend analysis of the principal components,” Chem. Eng. Res. Des., vol. 83,
no. 9, pp. 1122–1132, 2005.
[24] H. Zou, T. Hastie, and R. Tibshirani, “Sparse principal component analysis,” J. Comput.
Graph. Stat., vol. 15, no. 2, pp. 265–286, 2006.
[25] J. Shi and W. Song, “Sparse principal component analysis with measurement errors,” J.
Stat. Plan. Inference, vol. 175, pp. 87–99, 2016.
[26] D. Shen, H. Shen, and J. S. Marron, “Consistency of sparse PCA in high dimension, low
sample size contexts,” J. Multivar. Anal., vol. 115, pp. 317–333, 2013.
[27] I. Jolliffe, Principal component analysis. Wiley Online Library, 2005.
[28] G. Chen and S.-E. Qian, “Denoising of hyperspectral imagery using principal
component analysis and wavelet shrinkage,” IEEE Trans. Geosci. Remote Sens., vol. 49,
no. 3, pp. 973–980, 2011.
[29] L. Zhang, W. Dong, D. Zhang, and G. Shi, “Two-stage image denoising by principal
component analysis with local pixel grouping,” Pattern Recognit., vol. 43, no. 4, pp.
1531–1549, 2010.
[30] W. Ku, R. H. Storer, and C. Georgakis, “Disturbance detection and isolation by dynamic
principal component analysis,” Chemom. Intell. Lab. Syst., vol. 30, no. 1, pp. 179–196,
1995.
[31] S. Narasimhan and S. L. Shah, “Model identification and error covariance matrix
estimation from noisy data using PCA,” Control Eng. Pract., vol. 16, no. 1, pp. 146–
155, 2008.
[32] J. C. Liao, R. Boscolo, Y.-L. Yang, L. M. Tran, C. Sabatti, and V. P. Roychowdhury,
“Network component analysis: reconstruction of regulatory signals in biological

138
systems,” Proc. Natl. Acad. Sci., vol. 100, no. 26, pp. 15522–15527, 2003.
[33] P.-S. Huang, S. D. Chen, P. Smaragdis, and M. Hasegawa-Johnson, “Singing-voice
separation from monaural recordings using robust principal component analysis,” in
2012 IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP), 2012, pp. 57–60.
[34] N. Locantore et al., “Robust principal component analysis for functional data,” vol. 8,
no. 1, pp. 1–73, 1999.
[35] F. De la Torre and M. J. Black, “Robust principal component analysis for computer
vision,” in Proceedings Eighth IEEE International Conference on Computer Vision.
ICCV 2001, 2001, vol. 1, pp. 362–369.
[36] E. J. Candès, X. Li, Y. Ma, and J. Wright, “Robust principal component analysis?,” J.
ACM, vol. 58, no. 3, p. 11, 2011.
[37] R. Jenatton, G. Obozinski, and F. Bach, “Structured sparse principal component
analysis,” in Proceedings of the Thirteenth International Conference on Artificial
Intelligence and Statistics, 2010, pp. 366–373.
[38] X. Qi, R. Luo, and H. Zhao, “Sparse principal component analysis by choice of norm,”
J. Multivar. Anal., vol. 114, pp. 127–160, 2013.
[39] C. R. Rao, “The use and interpretation of principal component analysis in applied
research,” Sankhyā Indian J. Stat. Ser. A, pp. 329–358, 1964.
[40] I. T. Jolliffe, N. T. Trendafilov, and M. Uddin, “A modified principal component
technique based on the LASSO,” J. Comput. Graph. Stat., vol. 12, no. 3, pp. 531–547,
2003.
[41] D. M. Witten, R. Tibshirani, and T. Hastie, “A penalized matrix decomposition, with
applications to sparse principal components and canonical correlation analysis,”
Biostatistics, vol. 10, no. 3, pp. 515–534, 2009.
[42] R. W. Serth and W. A. Heenan, “Gross error detection and data reconciliation in steam‐
metering systems,” AIChE J., vol. 32, no. 5, pp. 733–742, 1986.
[43] S. Sun, D. Huang, and Y. Gong, “Gross error detection and data reconciliation using
historical data,” Procedia Eng., vol. 15, pp. 55–59, 2011.
[44] J. J. Downs and E. F. Vogel, “A plant-wide industrial process control problem,” Comput.
Chem. Eng., vol. 17, no. 3, pp. 245–255, 1993.
[45] Y. Kawabata, K. Wada, M. Nakatani, S. Yamada, and S. Onoue, “Formulation design
for poorly water-soluble drugs based on biopharmaceutics classification system: Basic
approaches and practical applications,” Int. J. Pharm., vol. 420, no. 1, pp. 1–10, 2011.
[46] A. K. Nayak and P. P. Panigrahi, “Solubility enhancement of etoricoxib by cosolvency
approach,” ISRN Phys. Chem., vol. 2012, no. Article ID 820653, p. 5 pages, 2012.
[47] Z. Li and P. I. Lee, “Investigation on drug solubility enhancement using deep eutectic
solvents and their derivatives,” Int. J. Pharm., vol. 505, no. 1, pp. 283–288, 2016.
[48] T. Loftsson, “Drug solubilization by complexation,” Int. J. Pharm., vol. 531, no. 1, pp.
276–280, 2017.
[49] D. P. Elder, R. Holm, and H. L. de Diego, “Use of pharmaceutical salts and cocrystals
to address the issue of poor solubility,” Int. J. Pharm., vol. 453, no. 1, pp. 88–100, 2013.

139
[50] K. T. Savjani, A. K. Gajjar, and J. K. Savjani, “Drug Solubility: Importance and
Enhancement Techniques,” ISRN Pharm., vol. 2012, p. 195727, Jul. 2012.
[51] V. R. Vemula, V. Lagishetty, and S. Lingala, “Solubility enhancement techniques,” Int.
J. Pharm. Sci. Rev. Res., vol. 5, no. 1, pp. 41–51, 2010.
[52] L. Di, P. V Fish, and T. Mano, “Bridging solubility between drug discovery and
development,” Drug Discov. Today, vol. 17, no. 9, pp. 486–495, 2012.
[53] H. D. Williams et al., “Strategies to Address Low Drug Solubility in Discovery and
Development,” Pharmacol. Rev., vol. 65, no. 1, pp. 315 LP – 499, Jan. 2013.
[54] W. L. Jorgensen and E. M. Duffy, “Prediction of drug solubility from Monte Carlo
simulations,” Bioorg. Med. Chem. Lett., vol. 10, no. 11, pp. 1155–1158, 2000.
[55] W. L. Jorgensen and E. M. Duffy, “Prediction of drug solubility from structure,” Adv.
Drug Deliv. Rev., vol. 54, no. 3, pp. 355–366, 2002.
[56] Y. Ran and S. H. Yalkowsky, “Prediction of Drug Solubility by the General Solubility
Equation (GSE),” J. Chem. Inf. Comput. Sci., vol. 41, no. 2, pp. 354–357, Mar. 2001.
[57] J. S. Delaney, “Predicting aqueous solubility from structure,” Drug Discov. Today, vol.
10, no. 4, pp. 289–295, 2005.
[58] A. Lusci, G. Pollastri, and P. Baldi, “Deep Architectures and Deep Learning in
Chemoinformatics: The Prediction of Aqueous Solubility for Drug-Like Molecules,” J.
Chem. Inf. Model., vol. 53, no. 7, pp. 1563–1575, Jul. 2013.
[59] A. Jouyban, “Review of the cosolvency models for predicting solubility of drugs in
water-cosolvent mixtures,” J. Pharm. Pharm. Sci., vol. 11, no. 1, pp. 32–58, 2008.
[60] A. Maitra and S. Bagchi, “Study of solute–solvent and solvent–solvent interactions in
pure and mixed binary solvents,” J. Mol. Liq., vol. 137, no. 1, pp. 131–137, 2008.
[61] S. H. Yalkowsky and T. J. Roseman, Techniques of solubilization of drugs. M. Dekker,
1981.
[62] W. E. Acree Jr, “Mathematical representation of thermodynamic properties: Part 2.
Derivation of the combined nearly ideal binary solvent (NIBS)/Redlich-Kister
mathematical representation from a two-body and three-body interactional mixing
model,” Thermochim. Acta, vol. 198, no. 1, pp. 71–79, 1992.
[63] A. Jouyban-Gharamaleki and J. Hanaee, “A novel method for improvement of
predictability of the CNIBS/R-K equation,” Int. J. Pharm., vol. 154, no. 2, pp. 245–247,
1997.
[64] C.-C. Chen and Y. Song, “Solubility Modeling with a Nonrandom Two-Liquid Segment
Activity Coefficient Model,” Ind. Eng. Chem. Res., vol. 43, no. 26, pp. 8354–8362, Dec.
2004.
[65] E. Mullins, Y. A. Liu, A. Ghaderi, and S. D. Fast, “Sigma Profile Database for Predicting
Solid Solubility in Pure and Mixed Solvent Mixtures for Organic Pharmacological
Compounds with COSMO-Based Thermodynamic Methods,” Ind. Eng. Chem. Res., vol.
47, no. 5, pp. 1707–1725, Mar. 2008.
[66] E. Sheikholeslamzadeh and S. Rohani, “Solubility Prediction of Pharmaceutical and
Chemical Compounds in Pure and Mixed Solvents Using Predictive Models,” Ind. Eng.
Chem. Res., vol. 51, no. 1, pp. 464–473, Jan. 2012.
[67] P. B. Kokitkar, E. Plocharczyk, and C.-C. Chen, “Modeling Drug Molecule Solubility

140
to Identify Optimal Solvent Systems for Crystallization,” Org. Process Res. Dev., vol.
12, no. 2, pp. 249–256, Mar. 2008.
[68] C.-C. Shu and S.-T. Lin, “Prediction of Drug Solubility in Mixed Solvent Systems Using
the COSMO-SAC Activity Coefficient Model,” Ind. Eng. Chem. Res., vol. 50, no. 1, pp.
142–147, Jan. 2011.
[69] M. Valavi, M. Svärd, and Å. C. Rasmuson, “Prediction of the Solubility of Medium-
Sized Pharmaceutical Compounds Using a Temperature-Dependent NRTL-SAC
Model,” Ind. Eng. Chem. Res., vol. 55, no. 42, pp. 11150–11159, Oct. 2016.
[70] A. Jouyban, N. Y. K. Chew, H.-K. Chan, M. Sabour, and W. E. Acree Jr, “A unified
cosolvency model for calculating solute solubility in mixed solvents,” Chem. Pharm.
Bull., vol. 53, no. 6, pp. 634–637, 2005.
[71] A. Jouyban, S. Soltanpour, S. Soltani, E. Tamizi, M. A. A. Fakhree, and W. E. Acree,
“Prediction of drug solubility in mixed solvents using computed Abraham parameters,”
J. Mol. Liq., vol. 146, no. 3, pp. 82–88, 2009.
[72] A. Jouyban and M. A. A. Fakhree, “Experimental and Computational Methods
Pertaining to Drug Solubility,” Rijeka: InTech, 2012, p. Ch. 9.
[73] A. R. Katritzky et al., “Quantitative Correlation of Physical and Chemical Properties
with Chemical Structure: Utility for Prediction,” Chem. Rev., vol. 110, no. 10, pp. 5714–
5789, Oct. 2010.
[74] K. Roy, S. Kar, and R. N. Das, A primer on QSAR/QSPR modeling: Fundamental
concepts. Springer, 2015.
[75] M. Goodarzi, B. Dejaegher, and Y. Vander Heyden, “Feature selection methods in
QSAR studies,” J. AOAC Int., vol. 95, no. 3, pp. 636–651, 2012.
[76] K. Roy, S. Kar, and R. N. Das, “QSAR/QSPR Modeling: Introduction BT - A Primer
on QSAR/QSPR Modeling: Fundamental Concepts,” K. Roy, S. Kar, and R. N. Das,
Eds. Cham: Springer International Publishing, 2015, pp. 1–36.
[77] S. Yousefinejad and B. Hemmateenejad, “Chemometrics tools in QSAR/QSPR studies:
A historical perspective,” Chemom. Intell. Lab. Syst., vol. 149, pp. 177–204, 2015.
[78] V. Kuppuraj and R. Rengaswamy, “Evaluation of prediction error based fuzzy model
clustering approaches for multiple model learning,” Int. J. Adv. Eng. Sci. Appl. Math.,
vol. 4, no. 1–2, pp. 10–21, 2012.
[79] A. A. Adeniran and S. El Ferik, “Modeling and Identification of Nonlinear Systems: A
Review of the Multimodel Approach;Part 1,” IEEE Trans. Syst. Man, Cybern. Syst., vol.
47, no. 7, pp. 1149–1159, 2017.
[80] S. El Ferik and A. A. Adeniran, “Modeling and Identification of Nonlinear Systems: A
Review of the Multimodel Approach;Part 2,” IEEE Trans. Syst. Man, Cybern. Syst., vol.
47, no. 7, pp. 1160–1168, 2017.
[81] C. W. Yap, “PaDEL‐descriptor: An open source software to calculate molecular
descriptors and fingerprints,” J. Comput. Chem., vol. 32, no. 7, pp. 1466–1474, 2011.
[82] L. ChemAxon, “Instant J Chem/MarvinSketch,” 2012.
[83] S. Chinta, A. Sivaram, and R. Rengaswamy, “Prediction error-based clustering approach
for multiple-model learning using statistical testing,” Eng. Appl. Artif. Intell., vol. 77,
pp. 125–135, 2019.

141
[84] A. Jouyban et al., “Solubility Prediction of Drugs in Mixed Solvents Using Partial
Solubility Parameters,” J. Pharm. Sci., vol. 100, no. 10, pp. 4368–4382, Oct. 2011.
[85] A. Jouyban, M.-R. Majidi, H. Jalilzadeh, and K. Asadpour-Zeynali, “Modeling drug
solubility in water–cosolvent mixtures using an artificial neural network,” Farm., vol.
59, no. 6, pp. 505–512, 2004.
[86] A. Jouyban, M. A. A. Fakhree, T. Ghafourian, A. A. Saei, and W. E. Acree, “Deviations
of drug solubility in water-cosolvent mixtures from the Jouyban-Acree model–effect of
solute structure,” Die Pharm. Int. J. Pharm. Sci., vol. 63, no. 2, pp. 113–121, 2008.
[87] R. Murray-Smith. and T. A. (Eds. . Johansen, Multiple Model Approaches to Modelling
and Control. Taylor and Francis, London, 1997.
[88] H. Frigui and R. Krishnapuram, “A robust competitive clustering algorithm with
applications in computer vision,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 21, no.
5, pp. 450–465, 1999.
[89] G. Danuser and M. Stricker, “Parametric model fitting: From inlier characterization to
outlier detection,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 20, no. 3, pp. 263–280,
1998.
[90] V. Cherkassky and Y. Ma, “Multiple Model Estimation: A New Formulation for
Predictive Learning,” under Rev. IEE Trans. Neural Network., 2002.
[91] A. N. Venkat and R. D. Gudi, “Fuzzy segregation-based identification and control of
nonlinear dynamic systems,” Ind. Eng. Chem. Res., vol. 41, no. 3, pp. 538–552, 2002.
[92] M. A. Henson and D. E. Seborg, “Nonlinear control strategies for continuous
fermenters,” Chem. Eng. Sci., vol. 47, no. 4, pp. 821–835, 1992.
[93] R. Pickhardt, “Adaptive control of a solar power plant using a multi-model,” IEE Proc.
- Control Theory Appl., vol. 147, no. 5, pp. 493–500, 2000.
[94] J. G. Balchen, D. Ljungquist, and S. Strand, “State—space predictive control,” Chem.
Eng. Sci., vol. 47, no. 4, pp. 787–807, 1992.
[95] W. S. DeSarbo, R. L. Oliver, and A. Rangaswamy, “A simulated annealing methodology
for clusterwise linear regression,” Psychometrika, vol. 54, no. 4, pp. 707–736, 1989.
[96] H. Spath, “Algorithm 39 Clusterwise linear regression,” Computing, vol. 22, no. 4, pp.
367–373, 1979.
[97] G. Ferrari-Trecate, M. Muselli, D. Liberati, and M. Morari, “A clustering technique for
the identification of piecewise affine systems,” Automatica, vol. 39, no. 2, pp. 205–217,
2003.
[98] H. Nakada, K. Takaba, and T. Katayama, “Identification of piecewise affine systems
based on statistical clustering technique,” Automatica, vol. 41, no. 5, pp. 905–913, 2005.
[99] W. S. DeSarbo and W. L. Cron, “A maximum likelihood methodology for clusterwise
linear regression,” J. Classif., vol. 5, no. 2, pp. 249–282, 1988.
[100] H. Spath, “A fast algorithm for clusterwise linear regression,” Computing, vol. 29, no.
2, pp. 175–181, 1982.
[101] C. Hennig, “Models and methods for clusterwise linear regression,” Classif. Inf. Age,
pp. 179–187, 1999.
[102] C. Hennig, “Identifiablity of models for clusterwise linear regression,” J. Classif., vol.

142
17, no. 2, pp. 273–296, 2000.
[103] M. Wedel and C. Kistemaker, “Consumer benefit segmentation using clusterwise linear
regression,” Int. J. Res. Mark., vol. 6, no. 1, pp. 45–59, 1989.
[104] C. Preda and G. Saporta, “Clusterwise PLS regression on a stochastic process,” Comput.
Stat. Data Anal., vol. 49, no. 1, pp. 99–108, 2005.
[105] V. Cherkassky and Y. Ma, “Multiple model regression estimation,” IEEE Trans. neural
networks, vol. 16, no. 4, pp. 785–798, 2005.
[106] J. C. Bezdek, C. Coray, R. Gunderson, and J. Watson, “Detection and characterization
of cluster substructure i. linear structure: Fuzzy c-lines,” SIAM J. Appl. Math., vol. 40,
no. 2, pp. 339–357, 1981.
[107] F. Dufrenois and D. Hamad, “Fuzzy weighted support vector regression for multiple
linear model estimation: application to object tracking in image sequences,” in Neural
Networks, 2007. IJCNN 2007. International Joint Conference on, 2007, pp. 1289–1294.
[108] N. Elfelly, J.-Y. Dieulot, M. Benrejeb, and P. Borne, “A new approach for multimodel
identification of complex systems based on both neural and fuzzy clustering
algorithms,” Eng. Appl. Artif. Intell., vol. 23, no. 7, pp. 1064–1071, 2010.
[109] B. Pourbabaee, N. Meskin, and K. Khorasani, “Multiple-model based sensor fault
diagnosis using hybrid kalman filter approach for nonlinear gas turbine engines,” in
2013 American Control Conference, 2013, pp. 4717–4723.
[110] J. Ragot, “Diagnosis and control using multiple models. Application to a biological
reactor,” 2011 Int. Symp. Adv. Control Ind. Process., pp. 22–29, 2011.
[111] S. Dasgupta, B. D. O. Anderson, and R. J. Kaye, “Identification of physical parameters
in structured systems,” Automatica, vol. 24, no. 2, pp. 217–225, 1988.
[112] S. Paoletti, A. L. Juloski, G. Ferrari-Trecate, and R. Vidal, “Identification of hybrid
systems a tutorial,” Eur. J. Control, vol. 13, no. 2–3, pp. 242–260, 2007.
[113] R. Vidal and B. D. O. Anderson, “Recursive identification of switched ARX hybrid
models: Exponential convergence and persistence of excitation,” in Decision and
Control, 2004. CDC. 43rd IEEE Conference on, 2004, vol. 1, pp. 32–37.
[114] O. Rodolfo, M. Benoit, R. Jose, and M. Didier, “Nonlinear system identification using
heterogeneous multiple models,” International Journal of Applied Mathematics and
Computer Science, vol. 23. p. 103, 2013.
[115] M. Halkidi, Y. Batistakis, and M. Vazirgiannis, “Cluster validity methods: part I,” ACM
Sigmod Rec., vol. 31, no. 2, pp. 40–45, 2002.
[116] E. Rendon et al., “A comparison of internal and external cluster validation indexes,” in
Proceedings of the 2011 American Conference, San Francisco, CA, USA, 2011, vol. 29.
[117] Y.-L. Wu, C.-Y. Tang, M.-K. Hor, and P.-F. Wu, “Feature selection using genetic
algorithm and cluster validation,” Expert Syst. Appl., vol. 38, no. 3, pp. 2727–2732,
2011.
[118] D. C. Montgomery and G. C. Runger, Applied statistics and probability for engineers.
John Wiley & Sons, 2010.
[119] M. E. Gegundez, J. Aroba, and J. M. Bravo, “Identification of piecewise affine systems
by means of fuzzy clustering and competitive learning,” Eng. Appl. Artif. Intell., vol. 21,
no. 8, pp. 1321–1329, 2008.

143
[120] A. Tsanas and A. Xifara, “Accurate quantitative estimation of energy performance of
residential buildings using statistical machine learning tools,” Energy Build., vol. 49, pp.
560–567, 2012.
[121] M. Nikravesh, A. E. Farell, and T. G. Stanford, “Control of nonisothermal CSTR with
time varying parameters via dynamic neural network control (DNNC),” Chem. Eng. J.,
vol. 76, no. 1, pp. 1–16, 2000.
[122] Z. Wang et al., “Clustering by Local Gravitation,” IEEE Trans. Cybern., vol. 48, no. 5,
pp. 1383–1396, 2018.
[123] T. Nguyen and R. F. Savinell, “Flow batteries,” Electrochem. Soc. Interface, vol. 19, no.
3, pp. 54–56, 2010.
[124] P. Leung, X. Li, C. P. De León, L. Berlouis, C. T. J. Low, and F. C. Walsh, “Progress
in redox flow batteries, remaining challenges and their applications in energy storage,”
Rsc Adv., vol. 2, no. 27, pp. 10125–10156, 2012.
[125] G. L. Soloveichik, “Flow Batteries: Current Status and Trends,” Chem. Rev., vol. 115,
no. 20, pp. 11533–11558, Oct. 2015.
[126] S. Er, C. Suh, M. P. Marshak, and A. Aspuru-Guzik, “Computational design of
molecules for an all-quinone redox flow battery,” Chem. Sci., vol. 6, no. 2, pp. 885–893,
2015.
[127] L. Constantinou and R. Gani, “New group contribution method for estimating properties
of pure compounds,” AIChE J., vol. 40, no. 10, pp. 1697–1710, 1994.
[128] C. Gao, R. Govind, and H. H. Tabak, “Application of the group contribution method for
predicting the toxicity of organic chemicals,” Environ. Toxicol. Chem., vol. 11, no. 5,
pp. 631–636, 1992.
[129] K. M. Klincewicz and R. C. Reid, “Estimation of critical properties with group
contribution methods,” AIChE J., vol. 30, no. 1, pp. 137–142, 1984.
[130] K. G. Joback and R. C. Reid, “Estimation of pure-component properties from group-
contributions,” Chem. Eng. Commun., vol. 57, no. 1–6, pp. 233–243, 1987.
[131] E. Conte, A. Martinho, H. A. Matos, and R. Gani, “Combined Group-Contribution and
Atom Connectivity Index-Based Methods for Estimation of Surface Tension and
Viscosity,” Ind. Eng. Chem. Res., vol. 47, no. 20, pp. 7940–7954, Oct. 2008.
[132] J. Marrero and R. Gani, “Group-contribution based estimation of pure component
properties,” Fluid Phase Equilib., vol. 183–184, pp. 183–208, 2001.
[133] J. Marrero and R. Gani, “Group-contribution-based estimation of octanol/water partition
coefficient and aqueous solubility,” Ind. Eng. Chem. Res., vol. 41, no. 25, pp. 6623–
6633, 2002.
[134] A. Correa, J. F. Comesaña, J. M. Correa, and A. M. Sereno, “Measurement and
prediction of water activity in electrolyte solutions by a modified ASOG group
contribution method,” Fluid Phase Equilib., vol. 129, no. 1, pp. 267–283, 1997.
[135] S. J. Patel, D. Ng, and M. S. Mannan, “QSPR Flash Point Prediction of Solvents Using
Topological Indices for Application in Computer Aided Molecular Design,” Ind. Eng.
Chem. Res., vol. 48, no. 15, pp. 7378–7387, Aug. 2009.
[136] A. R. Katritzky, Y. Wang, S. Sild, T. Tamm, and M. Karelson, “QSPR Studies on Vapor
Pressure, Aqueous Solubility, and the Prediction of Water−Air Partition Coefficients,”

144
J. Chem. Inf. Comput. Sci., vol. 38, no. 4, pp. 720–725, Jul. 1998.
[137] M. Muehlbacher, A. El Kerdawy, C. Kramer, B. Hudson, and T. Clark, “Conformation-
Dependent QSPR Models: logPOW,” J. Chem. Inf. Model., vol. 51, no. 9, pp. 2408–
2416, Sep. 2011.
[138] P. R. Duchowicz and E. A. Castro, “QSPR studies on aqueous solubilities of drug-like
compounds,” Int. J. Mol. Sci., vol. 10, no. 6, pp. 2558–2577, Jun. 2009.
[139] F. Luan, T. Wang, L. Tang, S. Zhang, and M. Cordeiro, “Estimation of the Toxicity of
Different Substituted Aromatic Compounds to the Aquatic Ciliate Tetrahymena
pyriformis by QSAR Approach.,” Molecules, vol. 23, no. 5, 2018.
[140] T. Miyao, H. Kaneko, and K. Funatsu, “Inverse QSPR/QSAR Analysis for Chemical
Structure Generation (from y to x),” J. Chem. Inf. Model., vol. 56, no. 2, pp. 286–299,
Feb. 2016.
[141] L. Xu and W.-J. Zhang, “Comparison of different methods for variable selection,” Anal.
Chim. Acta, vol. 446, no. 1, pp. 475–481, 2001.
[142] D. K. Agrafiotis and W. Cedeño, “Feature Selection for Structure−Activity Correlation
Using Binary Particle Swarms,” J. Med. Chem., vol. 45, no. 5, pp. 1098–1107, Feb.
2002.
[143] S. Yousefinejad, F. Honarasa, and H. Montaseri, “Linear solvent structure-polymer
solubility and solvation energy relationships to study conductive polymer/carbon
nanotube composite solutions,” RSC Adv., vol. 5, no. 53, pp. 42266–42275, 2015.
[144] B. Hemmateenejad, “Optimal QSAR analysis of the carcinogenic activity of drugs by
correlation ranking and genetic algorithm-based PCR,” J. Chemom., vol. 18, no. 11, pp.
475–485, Nov. 2004.
[145] D. J. Livingstone, D. T. Manallack, and I. V Tetko, “Data modelling with neural
networks: advantages and limitations,” J. Comput. Aided. Mol. Des., vol. 11, no. 2, pp.
135–142, 1997.
[146] S. Wang and M. Tanaka, “Nonlinear system identification with piecewise-linear
functions,” IFAC Proc. Vol., vol. 32, no. 2, pp. 3796–3801, 1999.
[147] S. Chinta and R. Rengaswamy, “Machine Learning Derived Quantitative Structure
Property Relationship (QSPR) to Predict Drug Solubility in Binary Solvent Systems,”
Ind. Eng. Chem. Res., vol. 58, no. 8, pp. 3082–3092, Feb. 2019.
[148] C. Liang and D. A. Gallagher, “QSPR Prediction of Vapor Pressure from Solely
Theoretically-Derived Descriptors,” J. Chem. Inf. Comput. Sci., vol. 38, no. 2, pp. 321–
324, Mar. 1998.
[149] G. R. Famini, C. A. Penski, and L. Y. Wilson, “Using theoretical descriptors in
quantitative structure activity relationships: Some physicochemical properties,” J. Phys.
Org. Chem., vol. 5, no. 7, pp. 395–408, Jul. 1992.
[150] M. Jalali-Heravi, M. Asadollahi-Baboli, and P. Shahbazikhah, “QSAR study of
heparanase inhibitors activity using artificial neural networks and Levenberg–Marquardt
algorithm,” Eur. J. Med. Chem., vol. 43, no. 3, pp. 548–556, 2008.
[151] M. Gevrey, I. Dimopoulos, and S. Lek, “Review and comparison of methods to study
the contribution of variables in artificial neural network models,” Ecol. Modell., vol.
160, no. 3, pp. 249–264, 2003.

145
[152] C. Cortes and V. Vapnik, “Support-vector networks,” Mach. Learn., vol. 20, no. 3, pp.
273–297, 1995.
[153] S. Amari and S. Wu, “Improving support vector machine classifiers by modifying kernel
functions,” Neural Networks, vol. 12, no. 6, pp. 783–789, 1999.
[154] D. Hunter, H. Yu, I. I. I. M. S. Pukish, J. Kolbusz, and B. M. Wilamowski, “Selection
of Proper Neural Network Sizes and Architectures—A Comparative Study,” IEEE
Trans. Ind. Informatics, vol. 8, no. 2, pp. 228–240, 2012.
[155] S. R. Safavian and D. Landgrebe, “A survey of decision tree classifier methodology,”
IEEE Trans. Syst. Man. Cybern., vol. 21, no. 3, pp. 660–674, 1991.
[156] L. Breiman, “Random forests,” Mach. Learn., vol. 45, no. 1, pp. 5–32, 2001.
[157] J. Sklansky and L. Michelotti, “Locally trained piecewise linear classifiers,” IEEE
Trans. Pattern Anal. Mach. Intell., no. 2, pp. 101–111, 1980.
[158] D. R. Cox, “The regression analysis of binary sequences,” J. R. Stat. Soc. Ser. B, vol.
20, no. 2, pp. 215–232, 1958.
[159] S. B. Kotsiantis, I. Zaharakis, and P. Pintelas, “Supervised machine learning: A review
of classification techniques,” Emerg. Artif. Intell. Appl. Comput. Eng., vol. 160, pp. 3–
24, 2007.
[160] L. Nanni and A. Lumini, “An experimental comparison of ensemble of classifiers for
biometric data,” Neurocomputing, vol. 69, no. 13, pp. 1670–1673, 2006.
[161] L. Breiman, “Bagging predictors,” Mach. Learn., vol. 24, no. 2, pp. 123–140, 1996.
[162] I. Barandiaran, “The random subspace method for constructing decision forests,” IEEE
Trans. Pattern Anal. Mach. Intell, vol. 20, no. 8, 1998.
[163] L. Nanni and A. Lumini, “An experimental comparison of ensemble of classifiers for
bankruptcy prediction and credit scoring,” Expert Syst. Appl., vol. 36, no. 2, Part 2, pp.
3028–3033, 2009.
[164] A. Bharadwaj and S. Minz, “Hybrid Approach for Classification using Support Vector
Machine and Decision Tree,” in Int Conf Advances in Electronics, Electrical and
Computer Science Engineering (EEC 2012), 2012, pp. 337–341.
[165] M. Arun Kumar and M. Gopal, “A hybrid SVM based decision tree,” Pattern Recognit.,
vol. 43, no. 12, pp. 3977–3987, 2010.
[166] S. Chakrabartty, G. Singh, and G. Cauwenberghs, “Hybrid support vector
machine/hidden markov model approach for continuous speech recognition,” in
Proceedings of the 43rd IEEE Midwest Symposium on Circuits and Systems (Cat. No.
CH37144), 2000, vol. 2, pp. 828–831.
[167] N. Zaini, M. A. Malek, M. Yusoff, N. H. Mardi, and S. Norhisham, “Daily River Flow
Forecasting with Hybrid Support Vector Machine – Particle Swarm Optimization,” IOP
Conf. Ser. Earth Environ. Sci., vol. 140, p. 12035, 2018.
[168] A. Ghodselahi, “A hybrid support vector machine ensemble model for credit scoring,”
Int. J. Comput. Appl., vol. 17, no. 5, pp. 1–5, 2011.
[169] A. Rahman and S. Tasnim, “Ensemble classifiers and their applications: a review,” arXiv
Prepr. arXiv1404.4088, 2014.
[170] A. Kostin, “A simple and fast multi-class piecewise linear pattern classifier,” Pattern

146
Recognit., vol. 39, no. 11, pp. 1949–1962, 2006.
[171] D. Webb, Efficient piecewise linear classifiers and applications. University of Ballarat,
2011.
[172] G. T. Herman and K. T. D. Yeung, “On piecewise-linear classification,” IEEE Trans.
Pattern Anal. Mach. Intell., no. 7, pp. 782–786, 1992.
[173] H. Tenmoto, M. Kudo, and M. Shimbo, “Piecewise linear classifiers with an appropriate
number of hyperplanes,” Pattern Recognit., vol. 31, no. 11, pp. 1627–1634, 1998.
[174] A. M. Bagirov, J. Ugon, and D. Webb, “An efficient algorithm for the incremental
construction of a piecewise linear classifier,” Inf. Syst., vol. 36, no. 4, pp. 782–790, 2011.
[175] A. Astorino and M. Gaudioso, “Polyhedral separability through successive LP,” J.
Optim. Theory Appl., vol. 112, no. 2, pp. 265–293, 2002.
[176] A. M. Bagirov, “Max–min separability,” Optim. Methods Softw., vol. 20, no. 2–3, pp.
277–296, 2005.
[177] X. Huang, S. Mehrkanoon, and J. A. K. Suykens, “Support vector machines with
piecewise linear feature mapping,” Neurocomputing, vol. 117, pp. 118–127, 2013.
[178] J. A. K. Suykens and J. Vandewalle, “Least Squares Support Vector Machine
Classifiers,” Neural Process. Lett., vol. 9, no. 3, pp. 293–300, 1999.
[179] G. Ou and Y. L. Murphey, “Multi-class pattern classification using neural networks,”
Pattern Recognit., vol. 40, no. 1, pp. 4–18, 2007.
[180] A. Asuncion and D. Newman, “UCI machine learning repository.” 2007.

147
LIST OF PAPERS BASED ON THESIS

1. Sivadurgaprasad Chinta and Raghunathan Rengaswamy, Machine Learning Derived


Quantitative Structure Property Relationship (QSPR) to Predict Drug Solubility in
Binary Solvent Systems. Ind. Eng. Chem. Res. 2019, 58 (8), 3082–3092.

2. Sivadurgaprasad Chinta, Abhishek Sivaram and Raghunathan Rengaswamy, Prediction


Error-Based Clustering Approach for Multiple-Model Learning Using Statistical
Testing. Eng. Appl. Artif. Intell. 2019, 77, 125–135.

3. Deepak Maurya, Sivadurgaprasad Chinta, Abhishek Sivaram and Raghunathan


Rengaswamy, Incorporating prior knowledge about structural constraints in model
identification. Ind. Eng. Chem. Res. Under review.

4. Sivadurgaprasad Chinta and Raghunathan Rengaswamy, Machine learning based


QSPR approaches to predict solvation free energy of Quinone molecules for flow
battery applications. Manuscript under preparation.

5. Sivadurgaprasad Chinta and Raghunathan Rengaswamy, An adaptive prediction error


based multiple model SVM classifier. Manuscript under preparation.

148

You might also like