Professional Documents
Culture Documents
net/publication/353567135
CITATIONS READS
0 51
3 authors:
Yu Yen Ou
Yuan Ze University
67 PUBLICATIONS 1,279 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Using NLP approach on identifying molecular functions of a variety of proteins View project
All content following this page was uploaded by Trung-Duong Nguyen-Trinh on 01 August 2021.
https://doi.org/10.1093/bib/bbab277
Problem Solving Protocol
Abstract
Since 2015, a fast growing number of deep learning–based methods have been proposed for protein–ligand binding site
prediction and many have achieved promising performance. These methods, however, neglect the imbalanced nature of
binding site prediction problems. Traditional data-based approaches for handling data imbalance employ linear
interpolation of minority class samples. Such approaches may not be fully exploited by deep neural networks on
downstream tasks. We present a novel technique for balancing input classes by developing a deep neural network–based
variational autoencoder (VAE) that aims to learn important attributes of the minority classes concerning nonlinear
combinations. After learning, the trained VAE was used to generate new minority class samples that were later added to the
original data to create a balanced dataset. Finally, a convolutional neural network was used for classification, for which we
assumed that the nonlinearity could be fully integrated. As a case study, we applied our method to the identification of FAD-
and FMN-binding sites of electron transport proteins. Compared with the best classifiers that use traditional machine
learning algorithms, our models obtained a great improvement on sensitivity while maintaining similar or higher levels of
accuracy and specificity. We also demonstrate that our method is better than other data imbalance handling techniques,
such as SMOTE, ADASYN, and class weight adjustment. Additionally, our models also outperform existing predictors in
predicting the same binding types. Our method is general and can be applied to other data types for prediction problems
with moderate-to-heavy data imbalances.
Key words: variational autoencoder; convolutional neural network; protein–ligand binding site prediction; data imbalance
handlin; gelectron transport proteins
Trinh-Trung-Duong Nguyen is now a postdoctoral researcher at Computer science department of Yuan Ze University, Taiwan. She is interested in
application of machine (deep) learning in protein bioinformatics.
Duc-Khanh Nguyen received M.S Degree in Department of Information Management, Yuan Ze University, Taiwan. He is now working toward the Ph.D
degree in the Department of Information Management, Yuan Ze university. His current research interest include applying Artificial Intelligent, Machine
learning, Deep learning in smart manufacturing and healthcare.
Yu-Yen Ou is an Associate Professor in the Department of Computer Science and Engineering, Graduate Program in Biomedical Informatics, Yuan Ze
University, Taiwan. He received the B.S. degree in Department of Math and Computer Science Education, Taipei Municipal Teachers College and the
Ph.D. degree in Department of Computer Science and Information Engineering, National Taiwan University, Taiwan. His fields of professional interest
are Bioinformatics, Machine Learning, and Data Mining.
Submitted: 31 March 2021; Received (in revised form): 29 June 2021
© The Author(s) 2021. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com
1
2 Nguyen et al.
pathogenesis of diseases, providing insights into drug discovery of a diverse class of generative deep learning architectures, VAE-
and design. However, in the majority of cases, experimental based generative models have achieved good results generating
details related to protein–ligand interactions are lacking. The images [19], speech [20], sentences [21] and de novo molecular
latest version (5 February 2021) of the BioLiP database [1] shows designs [22], to name a few. With the VAE, we hoped to generate
negative training samples, are input into a convolutional neural prediction problems, PSSM has been an indispensable part of
network for training the prediction models (part C). Finally, the feature extraction processes. Chen et al. [33, 34] discovered that
trained models from Step 3 are used for predicting with unseen excluding PSSM profiles resulted in a greater decrease in
samples as an independent test (part D). prediction performance than including other feature types.
We used a nonredundant protein database and PSI-BLAST
[32] software with three iterations to convert raw sequences into
Benchmark dataset
PSSMs. Then, for each PSSM profile, we slid a window of size S
We collected raw sequences of electron transport proteins from along the matrix to generate N smaller matrices of size n × 20
the Universal Protein Resource (UniProt) [31] database (release with N being the length of the PSSM. If the center position is
2021-01). To create high-quality, unbiased datasets, we adopted a binding residue, the corresponding matrix is regarded as a
rigorous criteria for selecting data, as follows: (i) The query was positive sample and as a negative sample otherwise. In this way,
structured so that the resulting sequences have been reviewed the prediction for the central amino acid is supported by the
and were not fragmented. After this exclusion, 475 FAD- and information of the surrounding amino acids. Many groups often
466 FMN-binding electron transport protein sequences were perform a preliminary experiment to find the best value for S
retrieved. (ii) Proteins lacking FAD- and FMN-binding sites were among common choices of odd numbers, i.e. from 15 to 23. We
also excluded, which reduced the number of FAD- and FMN- adopted this approach using a quick and simple Random Forest
binding sequences to 228 and 105, respectively. (iii) PSI Blast [32] classification algorithm with computationally efficient window
was applied to discard sequences with a similarity level greater sizes from 11 to 19. We found that the area under the receiver
than 20%. After this step, 32 FAD- and 11 FMN-binding sequences operating characteristic curve performance on both datasets
remained and were used in this study. (iv) Each dataset was was highest with a window size of 15. (The performance details
divided into two parts: a cross-validation portion (26 FAD- and are presented in Supplementary Figure S1, available online at
8 FMN-binding sequences) for model construction and an inde- http://bib.oxfordjournals.org/, of the Supplementary data). We
pendent test portion (6 FAD- and 3 FMN-binding sequences) for accordingly chose a window size of 15 for further experiments.
model evaluation. (v) The description files with binding positions
were used to determine bound and unbound residues with only
Imbalanced data handling with a variational
the annotations determined by expert’s experiments. (Evidence
autoencoder
code ECO 0000250 was not applied.) Supplementary Table S2,
available online at http://bib.oxfordjournals.org/, of the Supple- An autoencoder is a self-supervised neural network that learns
mentary data displays the number of binding sites in each part. to encode the input x into a low-dimensional vector z, then
decodes and reconstructs the data so that the output d(z) is as
close to the input as possible. Dimension reduction for visual-
Representations of FAD- and FMN-binding electron
ization and noise reduction are common applications of autoen-
transport proteins
coders. In an autoencoder, the cost of the task is the reconstruc-
PSSM is a matrix containing positive and negative integers tion loss and the low-dimensional encoded values are referred
that represent motifs in biological sequences. Large positive to as the latent representation or bottleneck.
scores often indicate functional residues such as active site Extending the idea of the autoencoder, a VAE regularizes
residues or internal interaction sites. Therefore, in binding site the encoding distribution during training so that the latent
4 Nguyen et al.
space has good properties for enabling new data generation. By distribution is a Gaussian distribution, and we trained the
sampling different points from the latent space and decoding VAE to learn the mean and covariance of the distribution
them, many new data points bearing similar characteristics to jointly with the goal of reconstructing the output with minimal
the input can be created for use in downstream tasks. In order reconstruction errors. After training, our encoder outputted
to generate new high-quality data points, a constraint is imposed the means and covariances from which we sampled new
on learning the latent space so that it stores the latent attribute latent vectors for generating new samples by passing them
as a probability distribution. Therefore, the loss of this task through the decoder. The number of generated samples for
consists of two parts: the reconstruction loss, which is similar to each binding site prediction problem was calculated so that
that of the autoencoder, and the latent loss. Figure 3 describes we obtained a balanced input dataset for training. We kept the
the information flow of the two architectures for comparison. independent datasets intact, which means that they remained
We constructed a VAE with encoder and decoder con- imbalanced.
volutional neural networks to generate new samples for
minority classes (FAD- and FMN-binding sites). Part B of Figure 2
Variational autoencoder structure
illustrates this process. The representation based on a position-
specific scoring matrix input was flattened to a vector and The structure of the convolutional autoencoder contains the
then inputted into the encoder. We assumed that the latent basic building blocks of convolutional neural networks, such as
Addressing data imbalance using deep learning 5
a convolutional layer, a max pooling layer (or upsampling layer), Evaluation metrics
a batch normalization layer and an activation layer. The input
Prediction performance was evaluated by common standard
layer is a vector of size 15 × 20 that connects to the first 1D
metrics such as accuracy, specificity, sensitivity, Matthews cor-
convolutional layer of the encoder and is followed by a batch
relation coefficient (MCC) and the area under the receiver oper-
normalization layer and a max pooling layer. We used the Leaky
ating characteristic curve. In addition, we used the geometric
Rectified Linear Unit (LeakyReLu) activation function in all layers
mean (G-mean) of the accuracy rates calculated separately for
except the final layer. We also carried out the experiment with
the majority and minority classes as a single metric for compar-
different numbers of intermediate convolutional layers having
ison and for tuning the models. The G-mean has been used to
1024, 512 and 256 nodes interspersed with batch normalization
evaluate performance of learning algorithms with imbalanced
layers and max pooling layers. The output of the encoder is
datasets in a lot of studies [36–43]. Formulas for all the metrics
an m-dimensional latent layer with m being a hyperparameter.
can be seen in Supplementary Figure S2, available online at
The decoder was designed to be symmetric with the encoder. In
http://bib.oxfordjournals.org/, of the Supplementary data.
training our VAE, the loss function was composed of a recon-
struction term, which makes the encoding–decoding scheme
efficient, and a regularization term, which makes the latent
Results and discussion
space regular. The first term is specified as the mean squared
error and the second term is the Kullback–Leibler divergence [35], Visual representation of positive samples generated
which measures the difference between two distributions. The from the VAEs
loss function is then formulated as
To understand the distribution of generated samples from the
VAEs in comparison with the distribution of original data, we
used t-SNE, a popular algorithm for dimension reduction, to
VAEloss =|| x − d(z)||2 + KL [N (μx , σx ) , N (0, 1)] (1)
transform samples of 300 dimensions into samples of 2 dimen-
sions and plotted them in a two-dimensional map. Supplemen-
tary Figures S3 and S4, available online at http://bib.oxfordjou
where N(μx, σ x) and N(0,1) are the learning latent distribution
rnals.org/, in the Supplementary data present the projection
and standard normal distribution, respectively.
of original positive samples and generated positive samples
from VAE for FAD- and FMN-binding datasets, respectively. We
Convolutional neural network structure can observe from the figures that the distribution of generated
positive samples is similar to those of real positive samples.
Our 1D convolutional neural networks (1DCNN) serve as
This means that the generated samples bear the characteristics
the classifiers with the input being the balanced datasets
of the original data and can be used efficiently to balance the
during the training processes. In general, the 1DCNN scheme
datasets.
is Input-(1DConvolution-Dropout-MaxPooling-) × k-Flatten-
FullyConnected-Dropout-FullyConnected-Softmax with k being
learnable and less than 4. We also used the LeakyReLu
activation function in all layers except the final layer. The
Hyperparameter tuning processes
Softmax layer contains two nodes for binding and nonbinding As our process involves two different deep neural networks, a
classes. In training the CNN classifiers, we used the binary VAE for generating new samples and a CNN for classification,
cross-entropy loss. much effort was put into choosing and tuning hyperparameters
6 Nguyen et al.
Figure 4. (A) G-mean scores on an FAD-binding independent test over different latent vector dimensions. (B) G-mean scores on an FMN-binding independent test over
different latent vector dimensions.
of the two networks. However, both networks employed convolu- number of batch sizes and the learning rate. For the CNNs, apart
tional neural networks as their components. Basically, we used from the hyperparameters used in VAEs, we also considered
the training data for training and tuning the hyperparameters dropout values and weight decays. The optimization algorithm
and the independent test data for evaluating and comparing for both neural networks was fixed with Adam optimizer [44].
the optimal learned models. We tried different architectures Moreover, we present in the Supplementary data the search
(described above), and a grid search over the hyperparameters ranges for all hyperparameters, the best hyperparameter sets
was performed for each architecture. For the VAEs, we con- (Supplementary Tables S3 and S4 available online at http://bi
sider three hyperparameters, namely, the number of epochs, the b.oxfordjournals.org/), the best architectures for the variational
Addressing data imbalance using deep learning 7
autoencoder and the 1DCNN-based classifier (Supplementary Performance comparison with other data imbalance
Tables S5 and S6 available online at http://bib.oxfordjournals.o handling techniques
rg/).
In this section, we report the results of our analysis on the
effectiveness of using a variational autoencoder for handling
Effectiveness of varying sample sizes in the latent space data imbalances. We employed a variety of data imbalance han-
dling algorithms, such as SMOTE, BorderlineSMOTE, ADASYN
After obtaining the optimal models, we were interested in
and class weight adjustment, to generate balanced datasets and
analyzing the effect of varying vector dimensions of the samples
classified them with the convolutional neural networks. We
in Z on the prediction of unseen samples using independent
also carried out experiments where no data imbalance handling
datasets. We constructed the VAE with various dimensions,
technique was used. We also searched for the optimal hyperpa-
denoted as dim(z), from 2 to 64. For each dimension, we repeated
rameters for the convolutional neural networks. To make our
the prediction experiments 10 times and calculated the average,
analysis results more robust and reliable, we repeated each
maximum, minimum and standard deviation of the G-mean
experiment 10 times and averaged the results. Table 1 displays
performance. (Please see Figure 4A and B for FAD- and FMN-
the performance comparison for the independent test, with the
binding site predictions on independent datasets, respectively.)
best sensitivity and MCC values highlighted in bold. All standard
From these figures, we observe that the FAD-binding site
deviation (STD) values are also presented.
prediction model achieves the highest maximum and highest
From Table 1, we observe that the combination of VAE and
average G-mean when dim(z) is equal to 16. Regarding FMN-
CNN yields the best G-mean, sensitivity and MCC performance
binding site prediction, the models with dim(z) = 16 also obtained
on both the FAD and FMN independent tests. For the FAD dataset,
the highest average score and lowest standard deviation score.
our model obtained the best G-mean (0.72), which is far bet-
Considering the robustness of the models, we chose the models
ter than the second-best technique (CNN + Classweight with G-
with the highest average G-means (dim(z) = 16 in both cases) for
mean = 0.64). For the FMN dataset, our method is also better than
the next analysis.
the best approach (CNN + BorderlineSMOTE with G-mean = 0.68).
The reason for less improvement may be the smaller number
Effectiveness of various contributions of reconstruction of samples in the FMN training dataset compared to the FAD
loss and latent loss training dataset.
In this analysis, we varied the contribution of the reconstruction
loss and the latent loss to the total loss of the VAE models. We
Performance comparison with traditional machine
specified the total loss of the VAE as
learning algorithms
We further compared the effectiveness of our method with
Total loss = a ∗ reconstruction_loss + (1 − a) ∗ latent_loss (2)
several traditional machine learning algorithms, namely, Ran-
domForest, Support Vector Machine, XGBoost and AdaBoost (pre-
with α chosen from a list of 0.1, 0.2, 0.3, 0.4 and 0.5. viously called ADaBoost.M1). For each of these algorithms, we
The best models from the previous analysis were chosen searched for the best hyperparameters as well as best data
for this analysis on independent datasets. We report the best imbalance handling algorithms using a grid search and recorded
G-mean scores over five experiments for each value of α in the best G-mean performance. Similar to experiments reported
Figure 5, where it can be seen that the distributions of recon- in Table 1, we repeated each experiment 10 times and averaged
struction loss and latent loss have an impact on the prediction the best results. Table 2 shows the independent test perfor-
performance. The G-mean scores range from 0.72 to 0.85 for mance on five standard metrics for FAD and FMN datasets, with
FAD-binding site prediction and from 0.67 to 0.71 for FMN- the highest scores highlighted in bold. Our approach, with G-
binding site prediction. However, for both datasets, the trends are mean values of 0.76 and 0.73, MCC values of 0.48 and 0.51 on
not clear. FAD and FMN independent test data, respectively, outperforms
8 Nguyen et al.
Table 1. Performance comparison among different data imbalance handling techniques on independent tests. (Each experiment was repeated
10 times and average results are reported in format: m ± d, where m is the average and d is the standard deviation across the 10 experiments)
FAD
CNN+ Adasyn 0.98 ± 0.0 1.00 ± 0.0 0.40 ± 0.09 0.52 ± 0.04 0.62 ± 0.07
CNN+ Smote 0.98 ± 0.0 1.00 ± 0.0 0.36 ± 0.07 0.49 ± 0.03 0.60 ± 0.06
CNN+ BordelineSmote 0.98 ± 0.0 1.00 ± 0.0 0.40 ± 0.06 0.53 ± 0.04 0.63 ± 0.05
CNN+ Classweight 0.95 ± 0.01 0.96 ± 0.01 0.44 ± 0.11 0.26 ± 0.07 0.64 ± 0.08
CNN+ NoBalance 0.98 ± 0.0 1.00 ± 0.0 0.37 ± 0.08 0.53 ± 0.06 0.61 ± 0.06
CNN+ VAE (Proposed model) 0.98 ± 0.01 0.99 ± 0.01 0.52 ± 0.01 0.48 ± 0.05 0.72 ± 0.07
FMN
Table 2. Comparison between the proposed method and traditional machine learning algorithms for FAD- and FMN-binding site prediction (Each
experiment was repeated 10 times and average results are reported in format: m ± d, where m is the average and d is the standard deviation
across the 10 experiments)
FAD
Random Forest 0.98 ± 0.0 0.99 ± 0.0 0.44 ± 0.0 0.43 ± 0.0 0.66 ± 0.0
XGBoost 0.98 ± 0.0 0.99 ± 0.0 0.42 ± 0.0 0.43 ± 0.0 0.65 ± 0.0
SVM 0.98 ± 0.0 0.99 ± 0.0 0.53 ± 0.0 0.46 ± 0.0 0.72 ± 0.0
AdaBoost 0.89 ± 0.0 0.90 ± 0.0 0.56 ± 0.0 0.20 ± 0.0 0.71 ± 0.0
Proposed model 0.97 ± 0.01 0.98 ± 0.01 0.59 ± 0.1 0.48 ± 0.06 0.76 ± 0.06
FMN
Random Forest 0.84 ± 0.0 0.92 ± 0.0 0.36 ± 0.0 0.29 ± 0.0 0.57 ± 0.0
XGBoost 0.85 ± 0.0 0.93 ± 0.0 0.40 ± 0.0 0.36 ± 0.0 0.61 ± 0.0
SVM 0.69 ± 0.0 0.69 ± 0.0 0.69 ± 0.0 0.28 ± 0.0 0.69 ± 0.0
AdaBoost 0.66 ± 0.0 0.67 ± 0.0 0.64 ± 0.0 0.22 ± 0.0 0.65 ± 0.0
Proposed model 0.88 ± 0.01 0.94 ± 0.01 0.57 ± 0.03 0.51 ± 0.02 0.73 ± 0.02
(Optimal hyperparameters: FAD: Random Forest: num features = 50, num trees = 500, imbalance data handling algorithm (IDHA) = BorderlineSMOTE; XGBoost: learning
rate = 0.1, n_estimators =100, max_depth = 10, early_stopping_rounds = 10, IDHA = Adasyn; SVM: c = 0.01, g = 0.01, IDHA = BorderlineSMOTE; AdaBoost: num features = 500,
IDHA = BorderlineSMOTE; FMN: Random Forest: num features = 100, num trees = 200, IDHA = BorderlineSMOTE or SMOTE; XGBoost: learning rate = 0.1, n_estimators
=1000, max_depth = 5, early_stopping_rounds = 10, IDHA = Adasyn; SVM: c = 0.001, g = 0.01, IDHA = Adasyn; AdaBoost: num features = 100, IDHA = Adasyn).
The highest scores of each metrics are highlighted in bold.
all surveyed algorithms in predicting FAD- and FMN-binding by Mishra et al. [45] and LPIcom by Singh et al. [46] and
sites. Furthermore, we obtained sensitivity increases of 6–15% reported the performance comparison in Supplementary Table
on FAD and 17–21% if maintaining similar levels of accuracy S8 available online at http://bib.oxfordjournals.org/. We found
and specificity. However, we found that the results of traditional that our approach outperformed the previous methods in almost
machine learning algorithms are more stable with standard all metrics.
deviation being equal to 0. We guess that the reason for this is the
greater number of parameters in the VAEs and CNNs compared Reproducing the results
to those from the traditional machine learning algorithms.
To help reproduce the results, we have freely provided the
VAE_CNN_ETCBinder source codes and data at https://github.
Comparison with existing FAD- and FMN-biding com/khucnam/VAE_CNN_ETCBinder. Using our source codes,
site predictors biologists and other researchers can rebuild the model on their
Supplementary Table S7, available online at http://bib.oxfordjou own local machine for prediction. In the near future, we hope
rnals.org/, in the Supplementary data lists previous studies that to provide users lacking programming skills with a web-based
predicted FAD- and FMN-binding sites in electron transport prediction server.
proteins and in general proteins. To make a fair comparison
with respect to prediction performance, we need to use either
the same datasets or the same methods. However, the same
Conclusion
table shows that some of the datasets and working prediction In this study, we integrated a novel data imbalance handling
servers were unavailable when we performed the comparison. technique based on a VAE with a CNN to improve the identi-
Therefore, we inputted our independent datasets into FADPred fication of binding sites in electron transport proteins. Using
Addressing data imbalance using deep learning 9
in cohorts of head and neck cancer patients. Eur J Nucl Med 31. Apweiler R, Bairoch A, Wu CH, et al. UniProt: the
Mol Imaging 2020;47(12):2826–35. universal protein knowledgebase. Nucleic Acids Res
18. Molinari F, Raghavendra U, Gudigar A, et al. An efficient 2004;32(suppl_1):D115–9.
data mining framework for the characterization of symp- 32. Altschul SF, Madden TL, Schäffer AA, et al. Gapped BLAST