You are on page 1of 10

molecules

Article
High-Performance Prediction of Human Estrogen
Receptor Agonists Based on Chemical Structures
Yuki Asako and Yoshihiro Uesawa *
Department of Clinical Pharmaceutics Meiji Pharmaceutical University, 2-522-1 Noshio, Kiyose,
Tokyo 204-8588, Japan; y121007@std.my-pharm.ac.jp
* Correspondence: uesawa@my-pharm.ac.jp; Tel.: +81-42-495-8892

Academic Editor: Quan Zou


Received: 16 March 2017; Accepted: 19 April 2017; Published: 23 April 2017

Abstract: Many agonists for the estrogen receptor are known to disrupt endocrine functioning.
We have developed a computational model that predicts agonists for the estrogen receptor
ligand-binding domain in an assay system. Our model was entered into the Tox21 Data Challenge
2014, a computational toxicology competition organized by the National Center for Advancing
Translational Sciences. This competition aims to find high-performance predictive models for various
adverse-outcome pathways, including the estrogen receptor. Our predictive model, which is based
on the random forest method, delivered the best performance in its competition category. In the
current study, the predictive performance of the random forest models was improved by strictly
adjusting the hyperparameters to avoid overfitting. The random forest models were optimized from
4000 descriptors simultaneously applied to 10,000 activity assay results for the estrogen receptor
ligand-binding domain, which have been measured and compiled by Tox21. Owing to the correlation
between our model’s and the challenge’s results, we consider that our model currently possesses
the highest predictive power on agonist activity of the estrogen receptor ligand-binding domain.
Furthermore, analysis of the optimized model revealed some important features of the agonists,
such as the number of hydroxyl groups in the molecules.

Keywords: machine learning; random forest; estrogen receptor; Tox21 data challenge 2014; QSAR
prediction model

1. Introduction
Estrogen receptors (ER) belong to the steroid receptor superfamily of ligand-dependent
transcription factors [1]. Compounds related to ER activation, such as isoflavones and polycyclic
aromatic hydrocarbons, disrupt the endocrine processes in humans and other species, severely affecting
reproduction and growth [2,3]. Therefore, screening for ER agonists can counteract environmental
contaminants and improve public health. Although ultra-high-throughput screening systems have
been developed for several adverse-outcome pathways [4], experimental in vitro detection is limited
by the vast number of screening targets. Consequently, a comprehensive assay is precluded by both
economics and time. In contrast, predictive methods based on chemical structures can greatly accelerate
the estimation and are expected to replace wet experimental systems.
The National Institute of Health (NIH), Environmental Protection Agency, and Food and Drug
Administration have collaborated to launch the Tox21 challenge, a large project targeting a variety
of toxicity problems, including the environmental effects of ER agonists [5,6]. The Tox21 project has
assayed the activities of the compounds in the Tox21 10 K library [4], which contains 10,000 chemicals
for toxicity estimation. The adverse-outcome pathways selected for toxicity evaluations are the
androgen receptor (AR), aryl hydrocarbon receptor (AhR), estrogen receptor (ER), aromatase
and peroxisome proliferator-activated receptor (PPAR), nuclear factor (erythroid-derived-2)-like

Molecules 2017, 22, 675; doi:10.3390/molecules22040675 www.mdpi.com/journal/molecules


Molecules 2017, 22, 675 2 of 10

2/antioxidant responsive element (ARE), ATP-ase family AAA domain containing 5 (ATAD5), heat
shock factor response element (HSE), mitochondrial membrane potential (MMP), and p53 [7].
One aim of the Tox21 project is to construct a predictive system based on computational toxicology.
The Tox21 Data Challenge 2014 was organized by NIH’s National Center for Advancing Translational
Sciences as a “cloud-sourcing” search for high-performance predictive models of adverse-outcome
pathways. Using computational toxicology technologies, participants competed in accuracy of predictive
models with the biological toxic responses of compounds in the Tox21 10 K compound library, whose
activities were known, applying the abovementioned adverse-outcome pathways as the training set [8].
Dr. Yoshihiro Uesawa, a winner of the 2014 challenge and a coauthor of the present study, constructed
a predictive model for the ER-ligand binding domain (ER-LBD), which showed the best performance
among the models submitted by the registered teams [9]. On the contrary, there is a large-scale
modeling project, Collaborative Estrogen Receptor Activity Prediction Project (CERAPP), in which
many groups built and evaluated ER QSAR models [10]. Multiple models were developed through
collaboration between 17 international research groups with a common training set of 1677 chemicals,
resulting in 40 prediction models for binding, agonist, and antagonist ER activity. External validation
was performed with 7522 chemicals from the literature. This project demonstrated that using a
consensus of different models with large-scale data set allows for the improvement of prediction
abilities. In particular, the consensus model reached a balanced accuracy of >0.9 (using high-quality data).
According to the comprehensive review of Chou’s five-step rule [11], which has been implemented
in various publications [12–16], the following rules are useful for developing a statistical predictor for a
biological system: (i) construct/select a valid dataset for training and testing the predictor; (ii) translate
biological sequences into numerical descriptors that truly reflect the effectiveness of target classes;
(iii) select/develop an intelligent operational algorithm; (iv) correctly perform cross-validation tests
that objectively evaluate the expected outcomes of the predictor; and (v) construct a web predictor for
the model that is accessible to the public.
Uesawa’s ER model was based on a machine learning method called random forest (RF). RF is
an ensemble learning method that decides the best predictive result by majority rule of the various
predictive results gained from many decision trees [17–19]. Each tree is constructed from bootstrapped
data of the training set. This method achieves high predictive performance at low computational cost,
even when the dataset is large or prejudiced. It also assesses the importance of the variables used
in the construction. In the previous study, we confirmed that rigorous selection from many kinds of
RF models significantly improved the performance of RF. However, we also observed an overfitting
tendency [9]. In this study, we attempt to improve the predictive performance of the previous RF
construction by a novel RF optimization technique.

2. Methods

2.1. Conformations and Descriptors


Various descriptors were calculated by optimizing the three-dimensional structures of their
chemical constituents, as implemented in the Tox21 Data Challenge 2014 [9]. Briefly, the SD file of each
descriptor, which contains the chemical structure and ER-LBD assay results (active or inactive) of the
compound, was downloaded from a homepage dedicated to the competition [8]. The three-dimensional
conformations of the chemical structures in two configurations (a charged form at neutral pH and an
uncharged form) were calculated in the Molecular Operating Environment (MOE) 2013.08 (Chemical
Computing Group Inc., Montréal, QC, Canada) [20]. Finally, 4071 different molecular descriptors
were generated by MOE, MarvinView 6.0.0 (ChemAxon Kft., Budapest, Hungary) [21], and Dragon 6
(Talete srl., Milano, Italy).

2.2. Construction of Predictive Models


In the previous study [9], the original training set of 8733 chemicals and final evaluation set of
599 chemicals were downloaded from the homepage of the Tox21 Data Challenge website [8]. The same
Molecules 2017, 22, 675 3 of 10

Molecules 2017, 22, 75 3 of 10


data were used in the present study. The original training set was randomly divided into a training set
(50%) for constructing
same data were used the inpredictive
the presentmodel
study.andThe aoriginal
test settraining
(50%)set forwas model validation.
randomly dividedTheinto selection
a
training
processes in the
Molecules set (50%)
predictive
2017, 22, 75 for constructing the predictive model and a test set (50%) for
models were constructed from the training set (50%). In our basic model validation. The
3 of 10method,
selection processes in the predictive models were constructed from the training set (50%). In our basic
we applied the bootstrap-forest function in JMP Pro 12.0.1 (SAS Institute Inc., Cary, NC, USA) as the
method,
same datawe applied
were used the bootstrap-forest
in the present study.function in JMP
The original Pro 12.0.1
training (SASrandomly
set was Institute Inc., Cary,
divided NC,
into a
RF algorithm
USA) asset
training
[22]. Finally,
the(50%)
RF algorithm
the[22].
compounds
for constructing Finally,
in the
the compounds
the predictive
finalinevaluation
model and the final
a test
set were
setevaluation
predicted
set were
(50%) for model
byThe
predicted
validation.
the
by
model.
The evaluation
the model.
selection indices
processes were
The evaluation theindices
predictive
in the predictive were
models ability
the of each
predictive
were model
ability
constructed intraining
of each
from the the evaluation
model in
setthe step,
our and
evaluation
(50%). In step,the area
basic
under the
andreceiver
the area
method, operating
under
we applied the characteristic
thereceiver operating curve
bootstrap-forest function(ROC_AUC
characteristic
in JMPcurve Evaluation)
Pro(ROC_AUC
12.0.1 (see Figure
(SASEvaluation)
Institute 1). NC,
(seeCary,
Inc., Figure 1).
USA) as the RF algorithm [22]. Finally, the compounds in the final evaluation set were predicted by
the model. The evaluation indices were the predictive ability of each model in the evaluation step,
and the area under the receiver operating characteristic curve (ROC_AUC Evaluation) (see Figure 1).

Figure 1. Scheme of the model construction.


Figure 1. Scheme of the model construction.
Different methods in statistical prediction, such as the n-fold cross-validation test, sub-sampling
test, independent dataset test,Figureand jackknife
1. Schemecross-validation
ofsuch
the model test, have been adopted for evaluating
construction.
Different methods in statistical prediction, as the n-fold cross-validation test, sub-sampling
the performance of a prediction model [23–26]. As the jackknife test can lead to unique results [27,28],
test, independent dataset test, and jackknife cross-validation test, have been adopted for evaluating
it hasDifferent
been widely usedin
methods instatistical
bioinformatics [25,29–34].
prediction, In the
such as thisn-fold
study,cross-validation
for saving computational time, the
test, sub-sampling
the performance
independent
test, of data
independent a prediction
dataset model
test wastest,
used
and [23–26].
to jackknife
investigate As performance
the the jackknife
cross-validation test
of have
test, can
beenlead
the prediction to unique
model.
adopted results [27,28],
for evaluating
it has been widely used
the performance in bioinformatics
of a prediction model [23–26].[25,29–34]. In thistest
As the jackknife study, fortosaving
can lead unique computational
results [27,28], time,
2.3.
it hasEffects
been of Descriptors
widely used in bioinformatics [25,29–34]. In this study,
the independent data test was used to investigate the performance of the prediction model.for saving computational time, the
independent data test was
MOE computed 369 used to investigate
charged and 369 the performance
uncharged formsofof theeach
prediction model.
molecular descriptor (738
2.3. Effects of Descriptors
descriptors in total). For each descriptor group (charged, uncharged, and total), we constructed 100
2.3. Effects of Descriptors
RF models and compared their average ROC-AUC evaluation values. From the results, we estimated
MOE computed 369 charged and 369 uncharged forms of each molecular descriptor (738 descriptors
the contributions
MOE computed of the
369charged
chargedandand
uncharged forms in forms
369 uncharged the predictive
of eachperformance (Figure 2).(738
molecular descriptor
in total).descriptors
For eachindescriptor
total). For group (charged,
each descriptor uncharged,
group (charged, and total), we
uncharged, and constructed 100 RF models
total), we constructed 100 and
compared their average
RF models ROC-AUC
and compared evaluation
their average values.
ROC-AUC From the
evaluation results,
values. Fromwetheestimated
results, wethe contributions
estimated
of the charged and uncharged
the contributions forms
of the charged andinuncharged
the predictive performance
forms in the predictive(Figure 2). (Figure 2).
performance

Figure 2. Charged and uncharged forms 100 random forest (RF) models were constructed for the
charged, uncharged, and both forms of each descriptor. All models were involved in predicting the
activities of the estrogen receptor ligand-binding domain for the compounds in the final evaluation
Figure 2. Charged and uncharged forms 100 random forest (RF) models were constructed for the
2. Charged
Figure charged, and uncharged
uncharged, forms
and both forms 100descriptor.
of each random All forest (RF)
models models
were were
involved constructed
in predicting the for the
charged, uncharged,
activities and both
of the estrogen formsligand-binding
receptor of each descriptor. Allthe
domain for models werein
compounds involved in predicting the
the final evaluation
activities of the estrogen receptor ligand-binding domain for the compounds in the final evaluation
set. 100 ROC_AUC values were plotted for each group. Green lines denote the averages and their 95%
confidence intervals.
Molecules 2017,
Molecules 2017, 22,
22, 75
75 44 of
of 10
10

set. 100
set. 100 ROC_AUC
ROC_AUC values
values were
were plotted
plotted for
for each
each group.
group. Green
Green lines
lines denote
denote the
the averages
averages and
and their
their 95%
95%
Molecules 2017, 22, intervals.
confidence 675 4 of 10
confidence intervals.

Number of
Number of Descriptors
Number ofDescriptors
Descriptors
To
To ascertain whether the
ascertainwhether
whether the performance
performance of of our
our model
model would
would bebe improved
improved byby adding
adding more
more
To ascertain the performance of our model would be improved by adding more descriptors,
descriptors, we
descriptors, we constructed 100
100 RF
RF models
models of
of the
the 4071
4071 descriptors
descriptors calculated
calculated by
by MOE,
MOE, MarvinView,
MarvinView,
we constructed constructed
100 RF models of the 4071 descriptors calculated by MOE, MarvinView, and Dragon,
and
and Dragon,
Dragon, and
and compared
compared their
their predictive
predictive abilities
abilities with
with those
those of
of RF
RF models
models constructed
constructed from MOE
from MOE
and compared their predictive abilities with those of RF models constructed from MOE descriptors
descriptors
descriptors (738 kinds) alone (Figure 3).
(738 kinds) (738
alonekinds) alone
(Figure 3). (Figure 3).

Figure 3.
Figure 3.
Figure Number
Numberof
3. Number of descriptors
ofdescriptors 100
descriptors100
100RFRF models
RFmodels were
modelswere constructed
wereconstructed for
constructedfor both
forboth numbers
bothnumbers of
numbersof descriptors.
ofdescriptors.
descriptors.
All models
All models
All were
modelswere involved
were involved
involvedin in predicting
inpredicting
predictingthethe activities
theactivities of
activitiesof estrogen
ofestrogen receptor
estrogenreceptor ligand-binding
receptorligand-binding domain
ligand-bindingdomain
domainforfor
for
compounds
compounds in
compounds in the
inthe final
thefinal evaluation
finalevaluation set.
evaluation set. 100
set.100 ROC_AUC
100ROC_AUC values
ROC_AUCvalues were
valueswere plotted
wereplotted for
plottedfor each
foreach group.
eachgroup. Green
group.Green lines
Greenlines
lines
denote the
denotethe
denote averages
the averages
averages andand their
and their 95%
their 95% confidence
95% confidence intervals.
confidence intervals.
intervals.

2.4. Effects
2.4.Effects
2.4. of
Effectsof Hyperparameters
ofHyperparameters
Hyperparameters
To
To find
To find the
find the best
thebest combination
bestcombination
combinationof of of
hyperparameters
hyperparameters
hyperparameters in the
in the RF construction,
in the
RF construction,
RF we scanned
construction,
we scanned two
we scanned
two
hyperparameters
two hyperparameters
hyperparameters under the
underunder following conditions
the following
the following (10
conditions
conditions iterations per
(10 iterations
(10 iterations condition): Number
per condition):
per condition): Number of
Number Terms (the
of Terms
of Terms (the
number
number of
(the number columns
of columns considered
of columns considered
considered as splitting
as splitting
as splitting candidates
candidates
candidates at each
at at each
each split
split
split (range 1–1000))
(range1–1000))
(range and
1–1000))and Maximum
andMaximum
Maximum
Splits
Splits per Tree
Splits per Tree(the
Tree (themaximum
(the maximumnumber
maximum number
number of
of of splits
splits for
forfor
splits each each tree
treetree
each (range
(range 2–400)).
2–400)).
(range For Number
For each
2–400)). For each Number
each Number of
of Terms
of
Terms we
we constructed
Terms constructed 190
190 models.
we constructed models.
The ROC_AUC
190 models. The ROC_AUC
values of all
The ROC_AUC values of
constructed
values all constructed models
models weremodels
of all constructed were
then estimatedthen
were thenon
estimated on
the compounds
estimated the
on theincompounds in
the test setin
compounds the
(50%) test set
andset
the test (50%)
evaluation
(50%) andand evaluation
setevaluation set (Figure
(Figure 4).set (Figure 4). 4).

Figure 4. Relationship between ROC_AUC values in models constructed from the test set (50%) and
the final evaluation set. Each point denotes the performance of the model. This figure is referred
from [9].
Molecules 2017, 22, 75 5 of 10
Molecules 2017, 22, 75 5 of 10

Figure 4. Relationship between ROC_AUC values in models constructed from the test set (50%) and
Figure
Molecules . Relationship
2017,422, 675 between ROC_AUC values in models constructed from the test set (50%) and5 of 10
the final evaluation set. Each point denotes the performance of the model. This figure is referred from
the final evaluation set. Each point denotes the performance of the model. This figure is referred from
[9].
[9].
In Figure
In Figure 4,4, aa total
total of
of 950
950 RF
RF models
models are
are shown.
shown. These
Thesemodels
modelsarearepart
partofofthe
themodeling
modelingresults.
results.
In Figure 4, a total of 950 RF models are shown. These models are part of the modeling results.
We generated 100 models more with a better combination of the hyperparameters,
We generated 100 models more with a better combination of the hyperparameters, and a final model and a final model
We generated 100 models more with a better combination of the hyperparameters, and a final model
was selected
was selected from
from all all the
the models
models based
based onon the
the ROC_AUC
ROC_AUC values. This strategy
values. This strategy for
for RFRF model
model
was selected from all the models based on the ROC_AUC values. This strategy for RF model
construction was reported as “Rigorous Selection,” which indicated an excellent
construction was reported as “Rigorous Selection,” which indicated an excellent performance in theperformance in
construction was reported as “Rigorous Selection,” which indicated an excellent performance in the
the Tox21 data challenge
Tox21 data challenge 2014 [9]. 2014 [9].
Tox21 data challenge 2014 [9].
Furthermore, the
Furthermore, the 190
190 ROC_AUC
ROC_AUC Evaluation
Evaluation values
values obtained
obtained forfor each
each Number
Number ofof Terms
Terms were
were
Furthermore, the 190 ROC_AUC Evaluation values obtained for each Number of Terms were
compared (Figure 5). Finally, for the models with Number of Terms = 1000, we generated
compared (Figure 5). Finally, for the models with Number of Terms = 1000, we generated scatter plots scatter plots
compared (Figure 5). Finally, for the models with Number of Terms = 1000, we generated scatter plots
between Maximum
between Maximum Splits Splits per
per Tree
Tree and
and the
the ROC_AUCs
ROC_AUCs in in the
the training
training set
set (50%)
(50%) and
and evaluation
evaluation set set
between Maximum Splits per Tree and the ROC_AUCs in the training set (50%) and evaluation set
(Figure 6).
(Figure 6).
(Figure 6).

Figure 5. Effects of the hyperparameter Number of Terms on the RF modeling 190 RF models were
Figure 5.
Figure 5. Effects of the hyperparameter Number of Terms on the RF modeling 190 RF models were
constructed in each group, and all models were then involved in predicting the activities of the
constructedinineach
constructed eachgroup,
group, and
and all models
all models werewere
then then involved
involved in predicting
in predicting the activities
the activities of the
of the estrogen
estrogen receptor ligand-binding domain for compounds in the final evaluation set. Plotted are the
estrogenligand-binding
receptor receptor ligand-binding
domain for domain for compounds
compounds in the finalin the final evaluation
evaluation set. Plottedset.
are Plotted are the
the ROC_AUC
ROC_AUC values for the final evaluation set in each group. Green lines denote the averages and their
ROC_AUC
values values
for the finalfor the final evaluation
evaluation setgroup.
set in each in each Green
group.lines
Greendenote
lines denote the averages
the averages and their
and their 95%
95% confidence intervals.
95% confidence
confidence intervals.
intervals.

Figure 6. Effects of the hyperparameter Maximum Splits per Tree on the RF modeling ROC_AUC
Figure 6.
Figure 6. Effects
Effectsofofthe
thehyperparameter
hyperparameter Maximum
Maximum Splits
Splits per per
TreeTree onRFthe
on the RF modeling
modeling ROC_AUC ROC_AUC
values
values of the training set (50%) and final evaluation set are plotted in closed and open circles,
values
of of the training
the training set (50%)set (50%)
and final and final evaluation
evaluation set are
set are plotted plottedand
in closed in open
closedcircles,
and open circles,
respectively.
Large Maximum Splits per Tree introduced model overfitting. The predictive ability was optimized for
Maximum Splits per Tree = 6.
Molecules 2017, 22, 675 6 of 10

2.5. Statistical Treatment


Significant differences in the means were tested by an unpaired Student’s t-test and one-way
ANOVA, followed by least-significant difference analysis. All analyses, including the ROC_AUC
calculations, were performed in JMP-Pro. The significance level was set to p < 0.05.

3. Results and Discussion

3.1. Effects of Descriptors


Integrated descriptors from both charged and uncharged forms improved the predictive ability of
the RF models, relative to descriptors from unilateral forms (Figure 2). Varying the charge conditions
increased the diversity of the information in the RF models. Increasing the number of descriptors
from 738 (calculated by MOE alone) to 4071 (calculated by three software programs, MOE, Dragon
and MarvinView) also improved the predictive ability of the RF models (Figure 3). On the other hand,
feature selections based on the known importance of each descriptor during the RF-modeling failed to
improve the model performance (data not shown). These observations suggest that a large number of
descriptors are advantageous for our models’ performance.

3.2. Effects of Hyperparameters


The AUC values in the test set (50%) and the final evaluation set were not simply correlated;
rather, there was an optimal point at which the AUC of the test set (50%) corresponded to the highest
AUC of the evaluation model (Figure 4). This point was emphasized in our previous paper [9]. In the
competition, model selection was optimized using the AUC values in the test set (50%) because the
AUCs between the test set (50%) and the training set (50%) showed good linear correlation [9]. Next,
we analyzed the true performance of our models on the final evaluation set [9] (see Figure 4).
All further investigations in the current study were performed on the final evaluation set. In a
scanning search for the Number of Terms hyperparameter, the best predictive model was obtained at
Number of Terms = 1000 (Figure 5). A scanning search for the predictive capabilities of Maximum
Splits per Tree under restricted conditions, with Number of Terms = 1000, was also performed
(Figure 6). Figure 4 shows the scatter plot of ROC_AUC values in models with different combinations
of hyperparameters such as Number of Terms and Maximum Splits per Tree. The best performance
was among models with Number of Term = 1000 (red points). However, the red points include different
values of Maximum Splits per Tree between 2 and 400. Therefore, the data of the red points were
selected and re-plotted in Figure 6 with Maximum Splits per Tree in horizontal axis to display the
relation between the prediction performance and the hyperparameter.
RF models with high Maximum Splits per Tree were found to overfit the training set. The optimal
predictive ability of the models was obtained for Maximum Splits per Tree = 6 (Figure 6). We inferred
that we could regulate the overfitting by lowering the optimal point of Maximum Splits per Tree;
that is, by restricting the tree growth.

3.3. Discrimination Potential of Improved Models


Figure 7 shows the ROC curves that evaluate the discrimination capacities of the new predictive
model and the best model in the Tox21 Data Challenge 2014. The ROC-AUC values for the compounds
with ER-LBD activities in the final evaluation test set were 86.6% and 82.7% in the present and previous
models, respectively. In addition, the ROC between sensitivity and 1-specificity was better balanced
in the present model than in the previous model. Overall, the predictive performance of the present
model surpasses that of the previous model.
Figure 7 ROC curves for predicting ER-LBD-activating compounds in the newly proposed model
(left) and the previous best model in the Tox21 Data Challenge 2014 ROC-AUCs and hyperparameter
values in the models are also described.
Molecules 2017, 22, 75 7 of 10

Figure 7 ROC curves for predicting ER-LBD-activating compounds in the newly proposed
model 2017,
Molecules (left)22,and
the previous best model in the Tox21 Data Challenge 2014 ROC-AUCs7 ofand
675 10
hyperparameter values in the models are also described.

Figure7.7. ROC
Figure ROC curves
curves for
for predicting
predicting ER-LBD-activating
ER-LBD-activating compounds
compounds with
with the
the newly
newly proposed
proposedmodel
model
(left) and the best model of the Tox21 Data Challenge 2014 ROC-AUCs and hyperparameter
(left) and the best model of the Tox21 Data Challenge 2014 ROC-AUCs and hyperparameter values invalues
in the
the models
models areare
alsoalso described.
described.

3.4. Most
3.4. Most Important
Important Descriptors
Descriptors
The importance
The importance of
of the
the descriptors
descriptors used
used in
in constructing
constructing thethe present
present model
model waswas calculated
calculated from
from
their split rankings (count of each descriptor usage when partitioning the decision
their split rankings (count of each descriptor usage when partitioning the decision trees in the RFtrees in the RF
modeling)and
modeling) andG2
G2values
values(chi-squared
(chi-squaredvalues
valuesofof the
the likelihood-ratios)
likelihood-ratios) [35].
[35]. The The descriptor
descriptor rankings
rankings in
in 100 RF models under the optimized hyperparameter conditions are listed in Table
100 RF models under the optimized hyperparameter conditions are listed in Table 1. The top-ranking 1. The top-
ranking activator
ER-LBD ER-LBD activator
was SpMin was1-Bh(m).
SpMin 1-Bh(m). The topological
The topological shape, nArOH
shape, nArOH and number and number of OH
of OH groups
groupstobound
bound to the aromatic
the aromatic rings
rings might might contribute
contribute to the
to the activity ofactivity of this compound
this compound in the
in the ER-LBD ER-LBD
assay [36].
assay [36].
Table1.1. Most
Table Most important
important descriptors
descriptors Listed
Listed are
are the
the top
top 10
10 ranked
ranked descriptors
descriptors in
in the
the RF
RF modeling,
modeling,
determined from the split and G2 rankings.
determined from the split and G2 rankings.

Softwa Spr Sprit


Sprit G2G2
Descriptor
Descriptor Meaning
Meaning Software State State
Sprit G2 G2
re it Ranking
Ranking Ranking
Ranking
SpMin1_Bh( smallest eigenvalue n. 1 of Burden matrix Drago Uncharg 38.
SpMin1_Bh(m)smallest eigenvalue n. 1 of Burden matrix weighted by mass Uncharged 28.1
Dragon 28.1
38.5 1 1 1 1
m) weighted by mass n ed 5
SpMin1_Bh( smallest eigenvalue n. 1 of Burden matrix Drago
SpMin1_Bh(m)smallest eigenvalue n. 1 of Burden matrix weighted Dragon
by mass Charged 17.2
Charged 21.1 21.
17.2 2 2 2 2
weighted by mass
m) n 1
smallest eigenvalue n. 1 of Burden matrix
SpMin1_Bh(s) smallest eigenvalue n. 1 of Burden matrix weighted by I- Uncharged
Dragon 9.3
Drago Uncharg 11.4 11. 6 3
SpMin1_Bh(s) weighted by I-state 9.3 6 3
smallest eigenvalue n. 1 ofstate
Burden matrix n ed 4
SpMin1_Bh(i) Dragon Uncharged 5.0 5.7 13 9
smallest eigenvalue
weighted n. 1 of Burden
by ionization matrix weighted by
potential Drago Uncharg
SpMin1_Bh(i)
nArOH number of aromatic hydroxyls Dragon Charged 17.0 5.0
10.5 5.7 3 13 4 9
ionization potential n ed
nArOH number of aromatic hydroxyls Dragon Uncharged 13.3 7.7 4 6
Drago 10.
nArOH
O-057 phenolnumber
/ enol of aromatic OH
/ carboxyl hydroxyls Dragon Charged Charged
13.1 17.0
7.8 5 3 5 4
Chi_Dt Randic-like index from detour matrix Dragon n
Charged 5.8 6.1 5 8 7
CATS2D_03_LL CATS2D Lipophilic-Lipophilic at lag 03 Dragon Drago Uncharg
Charged 4.8 6.0 7.7 14 4 8
nArOH number of aromatic hydroxyls 13.3 6
CATS2D_05_LL CATS2D Lipophilic-Lipophilic at lag 05 Dragon n
Charged ed
5.6 2.7 9 19
logd(pH = 5.5) Lipophilicity under pH = 5.5 condition Marvin -
Drago 5.5 5.7 10 10
O-057
vsurf_HB7 phenol
H-bond donor/ enol / carboxyl
capacity 7 OH MOE Charged Charged
6.5 13.1
3.2 7.8 7 5 17 5
n
Molecules 2017, 22, 675 8 of 10

4. Conclusions
We constructed a new predictive model with higher discrimination ability for estrogenic compounds
than our previous best model, which was submitted to the Tox21 Data Challenge 2014. It means that
we had to succeed to deliver the excellent predictive power on estrogenic compounds. Correlating the
results of the ER-LBD sub-challenge in Tox21 data challenge 2014 with our results, we believe that the
current model, which uses a simple model based on ROC_AUC as an evaluation criterion, possesses
the best prediction ability for ER-LBD agonist activity. The best conditions for the RF modeling
regulated the overfitting of the test set (50%). This regulation was found to be important in the RF
model constructions. Furthermore, the model analyses revealed that in compounds such as SpMin
1-Bh(m), interactions with ER were facilitated by the structural and physicochemical properties of
the compounds and the number of phenolic OH groups. This modeling approach will be useful for
predicting the toxicity of compounds and producing new drugs and chemicals. As user-friendly and
publicly accessible web servers represent the future on developing practical and more useful models,
we shall work in the near future on providing a web server for the method presented in this paper.

Acknowledgments: This work was partially supported by KAKENHI from the Japan Society for the Promotion
of Science (JSPS) (15K08111) and the Long-Range Research Initiative (LRI) research program from the Japan
Chemical Industry Association (JCIA).
Author Contributions: Y.A. considered this project research under the supervision of Y.U. All authors read and
approved the final manuscript.
Conflicts of Interest: The authors declare no conflict of interest.

References
1. Katzenellenbogen, B.S.; Montano, M.M.; Ediger, T.R.; Sun, J.; Ekena, K.; Lazennec, G.; Martini, P.G.;
McInerney, E.M.; Delage-Mourroux, R.; Weis, K.; et al. Estrogen receptors: selective ligands, partners,
and distinctive pharmacology. Recent Prog. Horm. Res. 2000, 55, 163–193. [PubMed]
2. Setchell, K.D. Soy isoflavones—Benefits and risks from nature's selective estrogen receptor modulators
(SERMs). J. Am. Coll. Nutr. 2001, 20, 354S–362S. [CrossRef] [PubMed]
3. Zhang, Y.; Dong, S.; Wang, H.; Tao, S.; Kiyama, R. Biological Impact of Environmental Polycyclic Aromatic
Hydrocarbons (ePAHs) as Endocrine Disruptors. Environ. Pollut. 2016, 213, 809–824. [CrossRef] [PubMed]
4. Hsieh, J.H.; Sedykh, A.; Huang, R.; Xia, M.; Tice, R.R. A Data Analysis Pipeline Accounting for Artifactsin
Tox21 Quantitative High-Throughput Screening Assays. J. Biomol. Screen 2015, 20, 887–897. [CrossRef]
[PubMed]
5. United Environmental Protection Agency. Toxicology Testing in the 21st Century (Tox21). Available online:
http://www.epa.gov/chemical-research/toxicology-testing-21st-century-tox21 (accessed on 16 April 2017).
6. Attene-Ramos, M.S.; Miller, N.; Huang, R.; Michael, S.; Itkin, M.; Kavlock, R.J.; Austin, C.P.; Shinn, P.;
Simeonov, A.; Tice, R.R.; et al. The Tox21 Robotic Platform for the Assessment of Environmental
Chemicals-From Vision to Reality. Drug Discov. Today 2013, 18, 716–723. [CrossRef] [PubMed]
7. Gohlke, J.M.; Thomas, R.; Zhang, Y.; Rosenstein, M.C.; Davis, A.P.; Murphy, C.; Becker, K.G.; Mattingly, C.J.;
Portier, C.J. Genetic and environmental pathways to complex diseases. BMC Syst. Biol. 2009, 3, 46. [CrossRef]
[PubMed]
8. National Center for Advancing Translational Sciences. Tox21 Data Challenge 2014. Available online:
https://tripod.nih.gov/tox21/challenge/index.jsp (accessed on 16 April 2017).
9. Uesawa, Y. Rigorous Selection of Random Forest Models for Identifying Compounds that Activate
Toxicity-Related Pathways. Front. Environ. Sci. 2016, 4. [CrossRef]
10. Mansouri, K.; Abdelaziz, A.; Rybacka, A.; Roncaglioni, A.; Tropsha, A.; Varnek, A.; Zakharov, A.; Worth, A.;
Richard, A.M.; Grulke, C.M.; et al. CERAPP: Collaborative Estrogen Receptor Activity Prediction Project.
Environ. Health Perspect. 2016, 124, 1023–1033. [CrossRef] [PubMed]
11. Chou, K.C. Some remarks on protein attribute prediction and pseudo amino acid composition. J. Theor. Biol.
2011, 273, 236–247. [CrossRef] [PubMed]
Molecules 2017, 22, 675 9 of 10

12. Zhu, P.P.; Li, W.C.; Zhong, Z.J.; Deng, E.Z.; Ding, H.; Chen, W.; Lin, H. Predicting the subcellular localization
of mycobacterial proteins by incorporating the optimal tripeptides into the general form of pseudo amino
acid composition. Mol. Biosyst. 2015, 11, 558–563. [CrossRef] [PubMed]
13. Ding, C.; Yuan, L.F.; Guo, S.H.; Lin, H.; Chen, W. Identification of mycobacterial membrane proteins and their
types using over-represented tripeptide compositions. J. Proteom. 2012, 77, 321–328. [CrossRef] [PubMed]
14. Lin, H.; Chen, W.; Yuan, L.F.; Li, Z.Q.; Ding, H. Using over-represented tetrapeptides to predict protein
submitochondria locations. Acta Biotheor. 2013, 61, 259–268. [CrossRef] [PubMed]
15. Tang, H.; Su, Z.D.; Wei, H.H.; Chen, W.; Lin, H. Prediction of cell-penetrating peptides with feature selection
techniques. Biochem. Biophys. Res. Commun. 2016, 477, 150–154. [CrossRef] [PubMed]
16. Lin, H.; Li, Q.Z. Eukaryotic and prokaryotic promoter prediction using hybrid approach. Theory Biosci. 2011,
130, 91–100. [CrossRef] [PubMed]
17. Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [CrossRef]
18. Zhao, X.; Zou, Q.; Liu, B.; Liu, X. Exploratory predicting protein folding model with random forest and
hybrid features. Curr. Proteom. 2014, 11, 289–299. [CrossRef]
19. Liao, Z.; Ju, Y.; Zou, Q. Prediction of G-protein-coupled receptors with SVM-Prot features and random forest.
Scientifica 2016. [CrossRef] [PubMed]
20. Chemical Computing Group. MOE: Molecular Operating Environment. Available online: http://www.
chemcomp.com/ (accessed on 16 April 2017).
21. ChemAxon Kft. Budapest, Hungary. Available online: http://www.chemaxon.com (accessed on 16
April 2017).
22. SAS. JMP. Available online: http://www.jmp.com/ja_jp/home.html (accessed on 16 April 2017).
23. Yang, H.; Tang, H.; Chen, X.X.; Zhang, C.J.; Zhu, P.P.; Ding, H.; Chen, W.; Lin, H. Identification of Secretory
Proteins in Mycobacterium tuberculosis Using Pseudo Amino Acid Composition. Biomed. Res. Int. 2016,
2016, 5413903. [CrossRef] [PubMed]
24. Zhang, C.J.; Tang, H.; Li, W.C.; Lin, H.; Chen, W.; Chou, K.C. iOri-Human: Identify human origin of
replication by incorporating dinucleotide physicochemical properties into pseudo nucleotide composition.
Oncotarget 2016, 7, 69783–69793. [CrossRef] [PubMed]
25. Ding, H.; Li, D. Identification of mitochondrial proteins of malaria parasite using analysis of variance.
Amino Acids 2015, 47, 329–333. [CrossRef] [PubMed]
26. Lin, H.; Ding, H.; Guo, F.B.; Zhang, A.Y.; Huang, J. Predicting subcellular localization of mycobacterial
proteins by using Chou’s pseudo amino acid composition. Protein. Pept. Lett. 2008, 15, 739–744. [CrossRef]
[PubMed]
27. Lin, H.; Ding, C.; Song, Q.; Yang, P.; Ding, H.; Deng, K.J.; Chen, W. The prediction of protein structural class
using averaged chemical shifts. J. Biomol. Struct. Dyn. 2012, 29, 643–649. [CrossRef] [PubMed]
28. Chou, K.C.; Zhang, C.T. Prediction of protein structural classes. Crit. Rev. Biochem. Mol. Biol. 1995, 30,
275–349. [CrossRef] [PubMed]
29. Lin, H.; Ding, H.; Guo, F.B.; Huang, J. Prediction of subcellular location of mycobacterial protein using
feature selection techniques. Mol. Divers. 2010, 14, 667–671. [CrossRef] [PubMed]
30. Lin, H.; Chen, W. Prediction of thermophilic proteins using feature selection technique. J. Microbiol. Methods
2011, 84, 67–70. [CrossRef] [PubMed]
31. Yuan, L.F.; Ding, C.; Guo, S.H.; Ding, H.; Chen, W.; Lin, H. Prediction of the types of ion channel-targeted
conotoxins based on radial basis function network. Toxicol. In Vitro 2013, 27, 852–856. [CrossRef] [PubMed]
32. Ding, H.; Feng, P.M.; Chen, W.; Lin, H. Identification of bacteriophage virion proteins by the ANOVA feature
selection and analysis. Mol. Biosyst. 2014, 10, 2229–2235. [CrossRef] [PubMed]
33. Chen, X.X.; Tang, H.; Li, W.C.; Wu, H.; Chen, W.; Ding, H.; Lin, H. Identification of Bacterial Cell Wall Lyases
via Pseudo Amino Acid Composition. Biomed. Res. Int. 2016, 2016, 1654623. [CrossRef] [PubMed]
34. Ding, H.; Luo, L.; Lin, H. Prediction of cell wall lytic enzymes using Chou’s amphiphilic pseudo amino acid
composition. Protein Pept. Lett. 2009, 16, 351–355. [CrossRef] [PubMed]
35. Rao, J.N.; Scott, A.J. On Chi-Squared Tests for Multiway Contingency Tables with Cell Proportions Estimated
from Survey Data. Ann. Stat. 1984, 12, 46–60. [CrossRef]
Molecules 2017, 22, 675 10 of 10

36. List of Molecular Descriptors Calculated by Dragon. Available online: http://www.talete.mi.it/products/


dragon_molecular_descriptor_list.pdf (accessed on 16 April 2017).

Sample Availability: Sample Availability: Not Available.

© 2017 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).

You might also like