You are on page 1of 12

International Journal of Advance Foundation and Research in Computer (IJAFRC)

Volume 2, Issue 8, August - 2015. ISSN 2348 4853

Predictive Model for Blood-Brain Barrier


1Dhanalakshmi

N, 2Dr Asha Gowda Karegowda, 3Radha N.


1,2Siddagana Institute Of Technology, Tumkur,
3IRH Technologies Pvt. Ltd., Banglore
1dhanalakshmin.35@gmail.com, 2ashagksit@gmail.com

ABSTRACT
It is essential to determine whether a candidate molecule is capable of penetrating the Blood
Brain Barrier (BBB) in drug discovery and development. As all the molecules cannot enter into
brain, due to barrier, this barrier is known as blood-brain barrier. Blood brain barrier will not
allow all the molecules into the brain, only those molecules which contain the high concentration
with blood cells are allowed into brain. The objective of our work is to find which of the molecules
penetrate into the brain. Computational work is carried out using R tool on the dataset garnered
from forensic lab, Bengaluru. Among the 5 machine learning techniques namely SVM (Support
Vector Machine), Neural Network, Random Forest, Decision Tree and Multiple Linear Regression,
experimental result reviews that SVM (Support Vector Machine) gives better result compare to
other techniques for regression data and Decision Tree generates least error rate for
classification data.
Index Terms: Blood-Brain Barrier (BBB), Support Vector Machine, Decision Tree, Random Forest, Multi
Linear Regression, Logistic Regression, Neural Network.

I.

INTRODUCTION

The bloodbrain barrier (BBB)[1-10] is a highly selective permeability barrier that separates the
circulating blood from the brain extracellular fluid (BECF) in the central nervous system (CNS). The
bloodbrain barrier is formed by brain endothelial cells, which are connected by tight junctions with an
extremely high electrical resistivity of at least 0.1 m. The bloodbrain barrier allows the passage of
water, some gases, and lipid-soluble molecules by passive diffusion, as well as the selective transport of
molecules such as glucose and amino acids that are crucial to neural function.
Data mining has been used intensively and expansively by several organizations. In healthcare, data
mining is becoming increasingly prevalent, if not increasingly necessary. Data mining applications can
prominently benefit all parties intricate in the healthcare industry. For example, data mining can help
healthcare insurers detect fraud and abuse, healthcare organizations make customer relationship
management decisions, physicians identify effective treatments and best practices, and patients receive
improved and more reasonable healthcare services. The huge amounts of data generated by healthcare
transactions are multifaceted and voluminous to be processed and analyzed by traditional methods. Data
mining provides the methodology and technology to transform these banks of data into useful
information for decision making [11,12, 21].
Data mining on medical data[20] has great potential to improve the treatment quality of hospitals and
increase the survival rate of patients. Medical data mining is one of crucial issues to get valuable clinical
knowledge from medical databases. Early prediction methods have become an seeming need in many
clinical areas. Clinical study has found initial detection and intervention to be vital for averting clinical
falling in patients at general hospital [13]. The paper is organized as follows. Section 2 briefs about the
11 | 2015, IJAFRC All Rights Reserved

www.ijafrc.org

International Journal of Advance Foundation and Research in Computer (IJAFRC)


Volume 2, Issue 8, August - 2015. ISSN 2348 4853
work related to BBB. Section 3 describes R tool and the methodologies adopted for the current work.
Results are discussed in section 4 followed by conclusion and future work in Section 5.
II. RELATED WORK
Scott Doniger, et.al have used 50 molecules of of which 25 are active molecules and other 25 are in
active molecules, which has been divided into training dataset and test dataset randomly. Two different
algorithms are implemented namely Neural Network and Support Vector Machine. 30 validation sets
have done out of these 50 molecules. The results show that the support vector machine outperforms the
neural network. It was found that SVM can predict up to 96% of the molecules correctly, averaging 81.5%
where as neural network averages 75.7% [4]. An Artificial Neural Network (ANN) model has been
developed to predict the ratios of the steady-state concentrations of drugs in the brain to those in the
blood (logBB) from their molecular structural parameters [9]. Claudia Suenderhauf et.al have taken the
dataset consisting of 153 compounds and these molecules were compiled using more reliable in vivo BBB
permeability-surface area (logPS) products, which are obtained by direct internal carotid artery
perfusion. The open source Chemical Development Kit (CDK) was used to calculate physico-chemical
properties and descriptors. This data was split into two classes namely positively (CNSp+) and negatively
(CNSp) classiGied molecules refers to compounds with logPS values 2 and 3, respectively. The DTI
paradigm is an efficient and powerful method to solve even linearly inseparable problems. Two widely
used paradigms were used to induce decision trees. Decision tree built with the chi-squared automatic
interaction detector (CHAID) on CDK descriptors and Classification and regression tree (CART) based on
CDK descriptors [3]. Misha Denil et.al has taken dataset containing random179 molecules and calculated
using random forest algorithm. They compare this experimental values with theoretical values, it was
found that experimental values gave better result than the theoretical value [18].
III. METHODOLOGY USED
A. R Programming
R is a programming language and software environment for statistical computing and graphics. The R
language is widely used among statisticians and data miners for developing statistical software and
data analysis. Users can access R tool through a command-line interpreter. R Tool is a statistical
tool/platform/programming language which is free and open-source. It permits users to extend the
capabilities of R are extended through user-created packages, which allow specialized statistical
techniques, graphical devices import/export capabilities, reporting tools etc. [13-14].
B. Machine Learning Techniques
The following machine learning techniques have been experimented using r tool.
Decision Tree - A Decision Tree(DT) represents a set of rules that follows a hierarchy of classes and
values, used to classify the instances. An instance is classified by starting to test the attribute
specified by the root and then following the branch corresponding to the value of the attribute in the
instance. This process is then repeated for the sub-tree with root on the new node[17,18,21]. Package
rpart is to be included for Decision Tree in R tool [13-14].
Decision Tree has the following advantages:
Can be applied to any type of data
The final structure of the classifier is quite simple and can be stored and handled in a graceful
manner
12 | 2015, IJAFRC All Rights Reserved

www.ijafrc.org

International Journal of Advance Foundation and Research in Computer (IJAFRC)


Volume 2, Issue 8, August - 2015. ISSN 2348 4853

Handles very proficiently conditional information, subdividing the space into sub-spaces that are
handled individually.
Reveal normally robust and unresponsive to misclassification in the training set.

Random Forest - The Random Forest (RF) algorithm is based on the features of decision trees, but in its
place of having only one tree, there is a group of decision trees. The algorithm grows many result trees, in
order to improve predictive accuracy. It classifies one case using each tree in the new forest, and select a
final predicted outcome by conjoining the results through all trees using majority vote [18]. Package rf is
to be included for Random Forest in R tool [13-14].
Features of Random Forests include:
It is unexcelled in accuracy among current algorithms.
It runs competently on huge data bases.
It can handle thousands of input variables without variable deletion.
It gives approximations of what variables are significant in the classification.
Neural Network - ANN has been extensively in the field of healthcare. Neural networks is a non-linear
statistical data modeling used for classifications tasks. It makes use of interconnected artificial neurons to
process information those changes through an iterative process, where weights between neurons are
successively corrected. Neural networks are highly subtle to the data and generally they have reduced
ability to extrapolate beyond the restrictions of the input variables [19-21]. Package nnet is to be
included for Neural Network in R tool [13-14].
Support Vector Machine - A support vector machine (SVM) searches for support vectors which are
observations that are found to lie at the edge of an area in space which presents a boundary between one
of these classes of observations. SVM is used to classify the data which is non-separable data [21].
Package e1071 is to be included for Support Vector Machine in R tool [13-14].
IV. RESULTS AND DISCUSSION
A. Dataset
There were 1665 molecular descriptors which are concentrated with blood cells. 1665 molecular
descriptors is very complicated to calculate and find result. Hence weka(Waikato Environment for
Knowledge Analysis) tool is used for selecting significant features CfsSubsetEval module followed
by associate F-stepping (leave one out) . From these CfsubsetEval module 77 molecular descriptors are
selected based experimental logBB values. From these 77 molecular descriptors further reduce to 13
descriptors. These 13 molecular descriptors are highly concentrated with blood cells, molecules which
contain these descriptors are entered into brain. 135 compounds such as benzene, cyclopropane,
Aminopyrine, isoflurane, methane, propranolol, hydroxyzine, nitrous oxide, etc.. are selected and
produced as a dataset. This dataset is read in R tool/software. Table 1 gives the description about the 13
molecular descriptors.
logBB value i.e BloodBrain distribution concentration is computed. Experimental values that have
logBB>=0 is labeled as BBB+ and those which have logBB<0 is labeled as BBB-. Those molecules which
are BBB+ will enter into the brain and BBB- molecules will fail to cross the barrier.

13 | 2015, IJAFRC All Rights Reserved

www.ijafrc.org

International Journal of Advance Foundation and Research in Computer (IJAFRC)


Volume 2, Issue 8, August - 2015. ISSN 2348 4853
The dataset is used both for regression model and classification model, where regression analysis mainly
deal with the continuous data and classification analysis mainly deal with discrete data.
Performance measures used for regression model [9-10] is R-Square for techniques Decision Tree,
Random Forest, Neural Network, Support Vector Machine and Multi Linear Regression.
R-squared as the square of the correlation - The term "R-squared" is derived from this definition. Rsquared is the square of the correlation between the model's predicted values and the actual values. This
correlation can range from -1 to 1, and so the square of the correlation then ranges from 0 to 1. The
greater the magnitude of the correlation between the predicted values and the actual values, the greater
the R-squared, regardless of whether the correlation is positive or negative.
Figure 1 to 5 shows R-square for regression model for Decision Tree, Random Forest, Neural Network,
Support Vector Machine and Multi Linear Regression respectively. In figure 1-5 blue line indicate the true
points with best fit and black line along diagonal representing perfect correlation. The R-square value for
regression data using Decision Tree, Random Forest, Neural Network, Support Vector Machine and Multi
Linear Regression is found to be 0.4591, 0.7388, 0.7723, 0.8845, 0.7676 respectively as shown in figure
1-5. For the regression model among the 5 classifiers SVM resulted in the best R-square value.

Performance measure used for classification model is over all error rate for techniques Decision Tree,
Random Forest, Neural Network, Support Vector Machine. Figure 6-9 gives over all error rate measure
for classification model for Decision Tree, Random Forest, Neural Network, and Support Vector Machine
respectively. Decision Tree constructed for classification data is shown in figure 6. It shows that TSPA
(NO), Rle., Mor04 are significant molecular descriptors. Left sub-plot of figure 7 Conditional variable
importance calculated by randomly shuffling the values of a given dataset. Then, the difference of the
model accuracy before and after the random permutations, averaged over all trees in the forest, tells us
how important that predictor is for determining the outcome. For the right sub-plot in figure 7
experiments have conducted using 100 trees and number of variables tried at each split is 2. The final
measure of importance is the total decrease in a decision tree node's impurity (the splitting criterion)
when splitting on a variable. The splitting criterion used is the Gini index. This is measured for a variable
over all trees giving a measure of the mean decrease in the Gini index of diversity relating to the variable.
Based on this experiment left sub-plot of figure 8 indicates that TPSA(NO), Rle. , noPH are top 3
significant molecular descriptors, whereas right sub-plot of figure 8 indicates that TPSA(NO), Mor04m,
MATS5m are top 3 significant molecular descriptors. Neural network of (13-3-1) is shown in figure 8.
Figure 9 shows the outcome of Support Vector Machine for classification data where circles represent the
BBB+ train and dark circles represent the BBB +test, triangles represent the BBB- train and dark triangle
represent BBB- test where the BBB+ molecules will penetrate into the brain.
Figure 10 gives the comparison between classifiers for regression model for Decision Tree, Random
Forest, Neural Network, Support Vector Machine and Multi Linear Regression respectively. SVM provides
best classification for regression data. Figure 11 gives the comparison between classifiers for
classification model for Decision Tree, Random Forest, Neural Network, Support Vector Machine
respectively. Decision Tree generates least error rate for classification data.

14 | 2015, IJAFRC All Rights Reserved

www.ijafrc.org

International Journal of Advance Foundation and Research in Computer (IJAFRC)


Volume 2, Issue 8, August - 2015. ISSN 2348 4853

Fig 1. Predicted Vs Observed logBBvalue using Decision Tree model for regression data

Fi
g
2. Predicted Vs Observed logBB value using Random Forest model for regression data

15 | 2015, IJAFRC All Rights Reserved

www.ijafrc.org

International Journal of Advance Foundation and Research in Computer (IJAFRC)


Volume 2, Issue 8, August - 2015. ISSN 2348 4853

Fig 3. Predicted Vs Observed logBB value using Neural Network model for regression data

Fig 4. Predicted Vs Observed logBB value using Support Vector Machine model for regression data

16 | 2015, IJAFRC All Rights Reserved

www.ijafrc.org

International Journal of Advance Foundation and Research in Computer (IJAFRC)


Volume 2, Issue 8, August - 2015. ISSN 2348 4853

Fig 5. Predicted Vs Observed logBB value using Multi Linear Regression model for regression data

Fig 6. Decision Tree for classification data

17 | 2015, IJAFRC All Rights Reserved

www.ijafrc.org

International Journal of Advance Foundation and Research in Computer (IJAFRC)


Volume 2, Issue 8, August - 2015. ISSN 2348 4853

Fig 7. Random Forest model for classification data

Fig 8. Neural Network model for classification data

18 | 2015, IJAFRC All Rights Reserved

www.ijafrc.org

International Journal of Advance Foundation and Research in Computer (IJAFRC)


Volume 2, Issue 8, August - 2015. ISSN 2348 4853

Fig 9. Outcome of Support Vector Machine model for classification data

Fig 10. Comparison of classifiers for regression data

19 | 2015, IJAFRC All Rights Reserved

www.ijafrc.org

International Journal of Advance Foundation and Research in Computer (IJAFRC)


Volume 2, Issue 8, August - 2015. ISSN 2348 4853

Fig 11. Comparison of classifiers for classification data


Table 1. Description of 13 molecular descriptors

Sl
no
1
2
3
4
5
6

Name
No
BIC1
MATS5m
MATS5v
Mor04m
R1e+

7
8
9
10
11
12
13

nArNR2
nOHp
C-028
C-034
H-051
O-057
TPSA(NO)

Description
Number of Oxygen atoms
Bond Information Content index (neighborhood symmetry of 1-order)
Moran autocorrelation of lag 5 weighted by mass
Moran autocorrelation of lag 5 weighted by van der Waals volume
signal 04 / weighted by mass
R maximal autocorrelation of lag 1 / weighted by Sanderson
electronegativity
number of tertiary amines (aromatic)
number of primary alcohols
R--CRX
R--CR..X
H attached to alpha-C
phenol / enol / carboxyl OH
Topological polar surface area using N,O polar contributions

V. CONCLUSIONS AND FUTURE WORK


Blood brain barrier will not allow all the molecules into the brain, only those molecules which contain the
high concentration with blood cells are allowed into brain. Earlier studies used manually selected
descriptors for prediction. The objective of our work is to find which of the molecules penetrate into the
brain. Weka tool has been used to find the 13 significant descriptors out of 1665 descriptors. These
descriptors are highly correlated with the log BB property. Experiments have been conducted using
20 | 2015, IJAFRC All Rights Reserved

www.ijafrc.org

International Journal of Advance Foundation and Research in Computer (IJAFRC)


Volume 2, Issue 8, August - 2015. ISSN 2348 4853
Decision Tree, AdaBoost, Random Forest, SVM, and Neural Networks for regression data and
classification data. This work is at most important for the pharmacy department to find which compound
penetrates into the brain based on 13 significant molecules. Experiments have been conducted using
with 137 compounds, in future we would like to work with 150 more compounds, other than 137
compounds used for the current work. In addition as part of further work, authors would like to explore
computational work using many more data mining techniques like KNN, Nave Bayes, Bayesian classifier
and ensemble learning like stacking, voting, grading, bagging and many more. Furthermore, authors
would like to adopt various bio inspired optimization techniques for significant feature selection which
would not only improve the performance of the classifiers but also reduce the computation time.
VI. REFERENCES
[1]

Xingrong Liu, Meihua Tu, Rebecca S. Kelly, Cuiping Chen, Bill J. Smith, Development of a
Computational Approach to Predict Blood-Brain Barrier Permeability, ASPET Journals (The
American Society for Pharmacology and Experimental Therapeutics), Vol. 32(1), pp.132139,2014.

[2]

William M Pardridge, Blood-brain barrier biology and methodology, Journal of NeuroVirology,


Vol. 5, pp. 556 569, 1999.

[3]

Claudia Suenderhauf , Felix Hammann and Jorg Huwyler, Computational Prediction of BloodBrain Barrier Permeability Using Decision Tree Induction, journal molecules, Vol. 17, pp. 1042910445, 2015.

[4]

Scott Doniger,Thomas Hofmann and Joanne Yeh, Predicting CNS Permeability of Drug Molecules:
Comparison of Neural Network and Support Vector Machine Algorithms, journal of computational
biology, Vol. 9(.6), pp. 849864, 2002.

[5]

Pardridge, W. CNS drug design based on principles of blood-brain barrier transport. J.


Neurochemistry Vol. 70(5), pp.17811792, 1998.

[6]

Norinder U and Haeberlein M , Computational approaches to the prediction of the blood-brain


distribution. Adv Drug Deliv Rev, Vol. 54 pp.291313, 2002.

[7]

Platts JA, Abraham MH, Zhao YH, Hersey A, Ijaz L, and Butina D , Correlation and prediction of a
large blood-brain distribution data setan LFER study. Eur J Med Chem , Vol.36, pp.719
730,2001.

[8]

Thomas Hofmann, Joanne Yeh, Predicting CNS using support vector machine algorithm, J. Comput.
Biol, Vol.10, pp.549558, 2002.

[9]

Prabha Garg and Jitender Verma In Silico Prediction of Blood Brain Barrier Permeability: An
Artificial Neural Network Model, J. Chem. Inf. Model, Vol.46, pp.289-297, 2006.

[10]

Keseru, G.M, A neural network based virtual high throughput screening test for the prediction of
CNS activity, Comb.Chem. High Throughput Screen, Vol. 3, pp.535540, 2000.

[11]

Milley, A. Healthcare and data mining. Health Management Technology, Vol. 21(8), pp. 44-47,
2000.

21 | 2015, IJAFRC All Rights Reserved

www.ijafrc.org

International Journal of Advance Foundation and Research in Computer (IJAFRC)


Volume 2, Issue 8, August - 2015. ISSN 2348 4853
[12]

Yi Mao, Yixin Chen, Gregory Hackmann, Minmin Chen, henyang Lu, Marin Kollef, Thomas C. Bailey,
Early Deterioration Warning for Hospitalized Patients by Mining Clinical Data, International
Journal of Knowledge Discovery in Bioinformatics, Vol. 2(3), pp.1-20, 2011.

[13]

Yanchang Zhao, Yonghua Cen, Data Mining Applications with R, Academic Press, 2013.

[14]

Garrett Grolemund, Hands-On Programming with R: Write Your Own Functions and Simulations,
Shroff/OReilly publications, 2014.

[15]

Freese, Jeremy and J. Scott Long. Regression Models for Categorical Dependent Variables Using,
Stata Press, 2006.

[16]

Long, J. Scott. Regression Models for Categorical and Limited Dependent Variables, Sage
Publications, 1997.

[17]

Quinlan, J. R, Induction of decision trees, Machine Learning, Vol. 1(1), pp. 81-106, 1986.

[18]

Breiman, L, Misha, Random forests, Machine Learning, Vol. 45(1), pp.34-39, 2001.

[19]

S. Haykin, Neural Networks- A comprehensive foundation, Macmillan Press, New York, (1994).

[20]

Siri Krishan Wasan, Vasudha Bhatnagar and Harleen Kaur, The Impact of Data
Techniques on Medical Diagnostics, Data Science Journal, vol 5, (2006).

[21]

J. Han, and M. Kamber, Data Mining: Concepts and Techniques, San Francisco, Morgan Kauffmann
Publishers, 2001.

22 | 2015, IJAFRC All Rights Reserved

Mining

www.ijafrc.org