You are on page 1of 11

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/353567135

Addressing data imbalance problems in ligand-binding site prediction using a


variational autoencoder and a convolutional neural network

Article  in  Briefings in Bioinformatics · July 2021


DOI: 10.1093/bib/bbab277

CITATIONS READS

0 51

3 authors:

Trung-Duong Nguyen-Trinh Duc-Khanh Nguyen


Novo Nordisk Protein Research Center, University of Copenhagen Yuan Ze University
19 PUBLICATIONS   136 CITATIONS    4 PUBLICATIONS   2 CITATIONS   

SEE PROFILE SEE PROFILE

Yu Yen Ou
Yuan Ze University
67 PUBLICATIONS   1,279 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

bioinformatics View project

Using NLP approach on identifying molecular functions of a variety of proteins View project

All content following this page was uploaded by Trung-Duong Nguyen-Trinh on 01 August 2021.

The user has requested enhancement of the downloaded file.


Briefings in Bioinformatics, 00(00), 2021, 1–10

https://doi.org/10.1093/bib/bbab277
Problem Solving Protocol

Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab277/6329407 by Yuan Ze University, khucnam@yahoo.com on 01 August 2021


Addressing data imbalance problems in
ligand-binding site prediction using a variational
autoencoder and a convolutional neural network
Trinh-Trung-Duong Nguyen, Duc-Khanh Nguyen and Yu-Yen Ou
Corresponding author: Yu-Yen Ou, Department of Computer Science and Engineering, Yuan Ze University, Chung-Li 32003, Taiwan.
Tel.: +886-3-4638800 #2185; E-mail: yien@saturn.yzu.edu.tw

Abstract
Since 2015, a fast growing number of deep learning–based methods have been proposed for protein–ligand binding site
prediction and many have achieved promising performance. These methods, however, neglect the imbalanced nature of
binding site prediction problems. Traditional data-based approaches for handling data imbalance employ linear
interpolation of minority class samples. Such approaches may not be fully exploited by deep neural networks on
downstream tasks. We present a novel technique for balancing input classes by developing a deep neural network–based
variational autoencoder (VAE) that aims to learn important attributes of the minority classes concerning nonlinear
combinations. After learning, the trained VAE was used to generate new minority class samples that were later added to the
original data to create a balanced dataset. Finally, a convolutional neural network was used for classification, for which we
assumed that the nonlinearity could be fully integrated. As a case study, we applied our method to the identification of FAD-
and FMN-binding sites of electron transport proteins. Compared with the best classifiers that use traditional machine
learning algorithms, our models obtained a great improvement on sensitivity while maintaining similar or higher levels of
accuracy and specificity. We also demonstrate that our method is better than other data imbalance handling techniques,
such as SMOTE, ADASYN, and class weight adjustment. Additionally, our models also outperform existing predictors in
predicting the same binding types. Our method is general and can be applied to other data types for prediction problems
with moderate-to-heavy data imbalances.

Key words: variational autoencoder; convolutional neural network; protein–ligand binding site prediction; data imbalance
handlin; gelectron transport proteins

Introduction activities. In the areas of molecular docking, pharmaceutical


Proteins perform their biological functions through interactions interaction, compound design, prediction of ligand affinity,
with other molecules. Accurately locating interaction sites or and even molecular dynamics, LBSs have received considerable
ligand-binding sites (LBSs) on proteins is therefore an initial step attention. Identifying LBSs not only helps explore intermolec-
in understanding the biological mechanisms underlying protein ular mechanisms, but it also effectively helps explain the

Trinh-Trung-Duong Nguyen is now a postdoctoral researcher at Computer science department of Yuan Ze University, Taiwan. She is interested in
application of machine (deep) learning in protein bioinformatics.
Duc-Khanh Nguyen received M.S Degree in Department of Information Management, Yuan Ze University, Taiwan. He is now working toward the Ph.D
degree in the Department of Information Management, Yuan Ze university. His current research interest include applying Artificial Intelligent, Machine
learning, Deep learning in smart manufacturing and healthcare.
Yu-Yen Ou is an Associate Professor in the Department of Computer Science and Engineering, Graduate Program in Biomedical Informatics, Yuan Ze
University, Taiwan. He received the B.S. degree in Department of Math and Computer Science Education, Taipei Municipal Teachers College and the
Ph.D. degree in Department of Computer Science and Information Engineering, National Taiwan University, Taiwan. His fields of professional interest
are Bioinformatics, Machine Learning, and Data Mining.
Submitted: 31 March 2021; Received (in revised form): 29 June 2021

© The Author(s) 2021. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com
1
2 Nguyen et al.

pathogenesis of diseases, providing insights into drug discovery of a diverse class of generative deep learning architectures, VAE-
and design. However, in the majority of cases, experimental based generative models have achieved good results generating
details related to protein–ligand interactions are lacking. The images [19], speech [20], sentences [21] and de novo molecular
latest version (5 February 2021) of the BioLiP database [1] shows designs [22], to name a few. With the VAE, we hoped to generate

Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab277/6329407 by Yuan Ze University, khucnam@yahoo.com on 01 August 2021


that only 299 051 of over 529 047 (56.5%) collected entries have new high-quality minority class samples for classification. For
regular ligand information, which means that addressing the the prediction tasks, we employed deep convolutional neural
ligand-binding sites problem is an important and urgent task for networks to optimize prediction performance.
further related work. Given the time-consuming and inefficient As a case study, we applied our method to the identification
process of biochemical experiments in the identification of LBSs of flavin adenine dinucleotide (FAD) and flavin mononucleotide
on proteins, an effective computational approach is essential. (FMN) binding sites of electron transport proteins. In electron
Over the last 5 years, deep learning has demonstrated its transport chains (ETCs) composed of four protein complexes (I,
power in protein–ligand binding prediction problems due to its II, III and IV) and adenosine triphosphate (ATP) synthase, FMN
outstanding performance compared to conventional machine is one part of Complex I, while FAD is involved in Complex
learning methods. Many groups have exploited deep neural net- II activities (Figure 1). In both the Krebs cycle and oxidative
works, either in a single architecture, or in hybrid architectures phosphorylation, FAD acts as an electron carrier. It accepts and
and ensemble approaches, to extract discriminative features for transfers electrons to FADH2. Then, FADH2 transfers the com-
classifying residues of interest into binding or nonbinding types. pound to Complex II. Several ATP molecules are generated for
The ligand types of interest are metals (Ca2+ , Fe3+ , Mg2+ , Mn2+ , each pair of FADH2 passing through the ETC. FAD also affects
Na+ , Zn2+ , etc.), biologically relevant molecule bindings (ADP, enzymes in charge of synthesizing other vital coenzymes such
ATP, FMN, FAD, GTP, Heme, NAD, PO3− 2−
4 , SO4 , etc.) and nucleic as Nicotinamide adenine dinucleotide. Serious riboflavin defi-
acids. In Supplementary Table S1, available online at http://bib.o ciency may result in insufficient coenzyme levels, poor energy
xfordjournals.org/, of the Supplementary data, we list 18 studies metabolism and consequent energy depletion. Abnormal energy
published from 2016 to the beginning of 2021 that use deep metabolism is known to be associated with many diseases,
learning–based models to predict protein–ligand binding sites. including Huntington’s disease [23], diabetes [24], neuromuscu-
We can see from the list that only 6 out of 18 studies (one-third) lar and neurological disorders and cancer [25]. Moreover, flavo-
employed at least one data imbalance handling technique. protein deficiencies occurring in metabolic pathways can cause
Binary classification on imbalanced datasets usually suffers human disorders such as glutaric acidemia and Leigh syndrome
from biased decisions in which most standard classification [26]. The identification of the FMN group in Complex I provides
algorithms favor the majority class, leading to poor accuracy in a pharmacologically accessible target for delaying aging and
minority class prediction. Data imbalance handling techniques treating neurodegenerative diseases such as Parkinson’s [27].
mainly fall into either of two categories: a data-based approach The improvement in mitochondrial respiration with FAD sup-
or an algorithm-based approach [2]. The data-based, or sam- plementation may reduce frataxin deficiency [28]. In the drug
pling, approach, which uses the techniques of undersampling design industry, FMN riboswitch is an emerging target for the
the majority class or oversampling the minority class, is more development of novel RNA-targeting antibiotics [29], while the
common. As the first technique discards important informa- generation of FAD analogs is a useful target for inhibiting bac-
tion of the majority class, the second technique may be the terial infection [30]. With these important roles and the effects
better choice. With oversampling techniques, if new positive of FAD- and FMN-binding functions in electron transport pro-
samples are created merely by being copied from samples of the teins, an efficient computational approach for predicting them
minority class, high false-positive rates may be seen. Improved is essential for biologists and other researchers.
techniques for generating new positive samples require creat- Our contributions are as follows: First, we propose a novel
ing synthetic samples from minority class samples that add approach based on a deep neural network for handling data
more meaningful information than does the duplication process. imbalances. To the best of our knowledge, we are the first group
SMOTE [3], ADASYN [4], and BorderlineSMOTE [5] are repre- to apply a variational autoencoder to this long-standing prob-
sentatives of this approach. In biomedical research, which has lem. Although we carried out our experiments for the prediction
a broader scope than ligand-binding site prediction, SMOTE- of FAD- and FMN-binding sites of electron transport proteins,
based and ADASYN-based algorithms have been widely applied this method is general and can be applied to other types of
in dealing with imbalanced data [6–18]. Despite the popularity data. Second, we analyzed the effect of varying sample vector
of these techniques, it is important to note that the generated dimensions in latent space Z and the effect of different distri-
synthetic samples are linear combinations of the features of butions of both reconstruction and latent losses on the final
the minority class samples. However, it is well-known that one performance. Third, we benchmarked our proposed approach
power of deep neural networks is their ability to learn from with conventional techniques for handling imbalanced data
nonlinearly separable data. Given the fact that most real-world (SMOTE, BorderlineSMOTE, ADASYN and class weight adjust-
datasets are nonlinearly separable, we assumed that if new ment). Finally, we analyzed the effectiveness of our prediction
samples were generated with the addition of nonlinearity, the models by comparing them with traditional machine learning
downstream tasks, (i.e. binary classification using deep neural algorithms and existing studies on the same binding types.
networks) would be able to fully integrate this nonlinearity and
thus learn optimally from the generated data. Based on the above
idea, we propose a new approach using a variational autoen-
Materials and methods
coder (VAE) to produce minority class samples. To generate new Our workflow, depicted in Figure 2, is as follows: (i) First, the raw
samples and balance the datasets, positive samples representing protein sequences are converted into position-specific scoring
the input are passed through the encoder and mapped to a matrices (PSSMs) of size 15 × 20 (part A). (ii) Subsequently, only
continuous latent space, Z, and then a latent representation the PSSMs of the positive training data samples are flattened
sample from Z is decoded by a decoder. The whole process is to put into the variational autoencoder. The end result of this
optimized for two tasks: reconstructing the input from Z and process is newly generated positive samples (part B). (iii) Next,
learning the probability distribution of the input. As members the generated positive samples, along with the original intact
Addressing data imbalance using deep learning 3

Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab277/6329407 by Yuan Ze University, khucnam@yahoo.com on 01 August 2021


Figure 1. The roles of FAD and FMN in an electron transport chain.

negative training samples, are input into a convolutional neural prediction problems, PSSM has been an indispensable part of
network for training the prediction models (part C). Finally, the feature extraction processes. Chen et al. [33, 34] discovered that
trained models from Step 3 are used for predicting with unseen excluding PSSM profiles resulted in a greater decrease in
samples as an independent test (part D). prediction performance than including other feature types.
We used a nonredundant protein database and PSI-BLAST
[32] software with three iterations to convert raw sequences into
Benchmark dataset
PSSMs. Then, for each PSSM profile, we slid a window of size S
We collected raw sequences of electron transport proteins from along the matrix to generate N smaller matrices of size n × 20
the Universal Protein Resource (UniProt) [31] database (release with N being the length of the PSSM. If the center position is
2021-01). To create high-quality, unbiased datasets, we adopted a binding residue, the corresponding matrix is regarded as a
rigorous criteria for selecting data, as follows: (i) The query was positive sample and as a negative sample otherwise. In this way,
structured so that the resulting sequences have been reviewed the prediction for the central amino acid is supported by the
and were not fragmented. After this exclusion, 475 FAD- and information of the surrounding amino acids. Many groups often
466 FMN-binding electron transport protein sequences were perform a preliminary experiment to find the best value for S
retrieved. (ii) Proteins lacking FAD- and FMN-binding sites were among common choices of odd numbers, i.e. from 15 to 23. We
also excluded, which reduced the number of FAD- and FMN- adopted this approach using a quick and simple Random Forest
binding sequences to 228 and 105, respectively. (iii) PSI Blast [32] classification algorithm with computationally efficient window
was applied to discard sequences with a similarity level greater sizes from 11 to 19. We found that the area under the receiver
than 20%. After this step, 32 FAD- and 11 FMN-binding sequences operating characteristic curve performance on both datasets
remained and were used in this study. (iv) Each dataset was was highest with a window size of 15. (The performance details
divided into two parts: a cross-validation portion (26 FAD- and are presented in Supplementary Figure S1, available online at
8 FMN-binding sequences) for model construction and an inde- http://bib.oxfordjournals.org/, of the Supplementary data). We
pendent test portion (6 FAD- and 3 FMN-binding sequences) for accordingly chose a window size of 15 for further experiments.
model evaluation. (v) The description files with binding positions
were used to determine bound and unbound residues with only
Imbalanced data handling with a variational
the annotations determined by expert’s experiments. (Evidence
autoencoder
code ECO 0000250 was not applied.) Supplementary Table S2,
available online at http://bib.oxfordjournals.org/, of the Supple- An autoencoder is a self-supervised neural network that learns
mentary data displays the number of binding sites in each part. to encode the input x into a low-dimensional vector z, then
decodes and reconstructs the data so that the output d(z) is as
close to the input as possible. Dimension reduction for visual-
Representations of FAD- and FMN-binding electron
ization and noise reduction are common applications of autoen-
transport proteins
coders. In an autoencoder, the cost of the task is the reconstruc-
PSSM is a matrix containing positive and negative integers tion loss and the low-dimensional encoded values are referred
that represent motifs in biological sequences. Large positive to as the latent representation or bottleneck.
scores often indicate functional residues such as active site Extending the idea of the autoencoder, a VAE regularizes
residues or internal interaction sites. Therefore, in binding site the encoding distribution during training so that the latent
4 Nguyen et al.

Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab277/6329407 by Yuan Ze University, khucnam@yahoo.com on 01 August 2021


Figure 2. The flowchart of our study.

space has good properties for enabling new data generation. By distribution is a Gaussian distribution, and we trained the
sampling different points from the latent space and decoding VAE to learn the mean and covariance of the distribution
them, many new data points bearing similar characteristics to jointly with the goal of reconstructing the output with minimal
the input can be created for use in downstream tasks. In order reconstruction errors. After training, our encoder outputted
to generate new high-quality data points, a constraint is imposed the means and covariances from which we sampled new
on learning the latent space so that it stores the latent attribute latent vectors for generating new samples by passing them
as a probability distribution. Therefore, the loss of this task through the decoder. The number of generated samples for
consists of two parts: the reconstruction loss, which is similar to each binding site prediction problem was calculated so that
that of the autoencoder, and the latent loss. Figure 3 describes we obtained a balanced input dataset for training. We kept the
the information flow of the two architectures for comparison. independent datasets intact, which means that they remained
We constructed a VAE with encoder and decoder con- imbalanced.
volutional neural networks to generate new samples for
minority classes (FAD- and FMN-binding sites). Part B of Figure 2
Variational autoencoder structure
illustrates this process. The representation based on a position-
specific scoring matrix input was flattened to a vector and The structure of the convolutional autoencoder contains the
then inputted into the encoder. We assumed that the latent basic building blocks of convolutional neural networks, such as
Addressing data imbalance using deep learning 5

Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab277/6329407 by Yuan Ze University, khucnam@yahoo.com on 01 August 2021


Figure 3. The information flow of a variational autoencoder compared to an autoencoder.

a convolutional layer, a max pooling layer (or upsampling layer), Evaluation metrics
a batch normalization layer and an activation layer. The input
Prediction performance was evaluated by common standard
layer is a vector of size 15 × 20 that connects to the first 1D
metrics such as accuracy, specificity, sensitivity, Matthews cor-
convolutional layer of the encoder and is followed by a batch
relation coefficient (MCC) and the area under the receiver oper-
normalization layer and a max pooling layer. We used the Leaky
ating characteristic curve. In addition, we used the geometric
Rectified Linear Unit (LeakyReLu) activation function in all layers
mean (G-mean) of the accuracy rates calculated separately for
except the final layer. We also carried out the experiment with
the majority and minority classes as a single metric for compar-
different numbers of intermediate convolutional layers having
ison and for tuning the models. The G-mean has been used to
1024, 512 and 256 nodes interspersed with batch normalization
evaluate performance of learning algorithms with imbalanced
layers and max pooling layers. The output of the encoder is
datasets in a lot of studies [36–43]. Formulas for all the metrics
an m-dimensional latent layer with m being a hyperparameter.
can be seen in Supplementary Figure S2, available online at
The decoder was designed to be symmetric with the encoder. In
http://bib.oxfordjournals.org/, of the Supplementary data.
training our VAE, the loss function was composed of a recon-
struction term, which makes the encoding–decoding scheme
efficient, and a regularization term, which makes the latent
Results and discussion
space regular. The first term is specified as the mean squared
error and the second term is the Kullback–Leibler divergence [35], Visual representation of positive samples generated
which measures the difference between two distributions. The from the VAEs
loss function is then formulated as
To understand the distribution of generated samples from the
VAEs in comparison with the distribution of original data, we
used t-SNE, a popular algorithm for dimension reduction, to
VAEloss =|| x − d(z)||2 + KL [N (μx , σx ) , N (0, 1)] (1)
transform samples of 300 dimensions into samples of 2 dimen-
sions and plotted them in a two-dimensional map. Supplemen-
tary Figures S3 and S4, available online at http://bib.oxfordjou
where N(μx, σ x) and N(0,1) are the learning latent distribution
rnals.org/, in the Supplementary data present the projection
and standard normal distribution, respectively.
of original positive samples and generated positive samples
from VAE for FAD- and FMN-binding datasets, respectively. We
Convolutional neural network structure can observe from the figures that the distribution of generated
positive samples is similar to those of real positive samples.
Our 1D convolutional neural networks (1DCNN) serve as
This means that the generated samples bear the characteristics
the classifiers with the input being the balanced datasets
of the original data and can be used efficiently to balance the
during the training processes. In general, the 1DCNN scheme
datasets.
is Input-(1DConvolution-Dropout-MaxPooling-) × k-Flatten-
FullyConnected-Dropout-FullyConnected-Softmax with k being
learnable and less than 4. We also used the LeakyReLu
activation function in all layers except the final layer. The
Hyperparameter tuning processes
Softmax layer contains two nodes for binding and nonbinding As our process involves two different deep neural networks, a
classes. In training the CNN classifiers, we used the binary VAE for generating new samples and a CNN for classification,
cross-entropy loss. much effort was put into choosing and tuning hyperparameters
6 Nguyen et al.

Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab277/6329407 by Yuan Ze University, khucnam@yahoo.com on 01 August 2021

Figure 4. (A) G-mean scores on an FAD-binding independent test over different latent vector dimensions. (B) G-mean scores on an FMN-binding independent test over
different latent vector dimensions.

of the two networks. However, both networks employed convolu- number of batch sizes and the learning rate. For the CNNs, apart
tional neural networks as their components. Basically, we used from the hyperparameters used in VAEs, we also considered
the training data for training and tuning the hyperparameters dropout values and weight decays. The optimization algorithm
and the independent test data for evaluating and comparing for both neural networks was fixed with Adam optimizer [44].
the optimal learned models. We tried different architectures Moreover, we present in the Supplementary data the search
(described above), and a grid search over the hyperparameters ranges for all hyperparameters, the best hyperparameter sets
was performed for each architecture. For the VAEs, we con- (Supplementary Tables S3 and S4 available online at http://bi
sider three hyperparameters, namely, the number of epochs, the b.oxfordjournals.org/), the best architectures for the variational
Addressing data imbalance using deep learning 7

Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab277/6329407 by Yuan Ze University, khucnam@yahoo.com on 01 August 2021


Figure 5. G-mean scores on FAD- and FMN-binding independent tests over different distributions of reconstruction loss and latent loss.

autoencoder and the 1DCNN-based classifier (Supplementary Performance comparison with other data imbalance
Tables S5 and S6 available online at http://bib.oxfordjournals.o handling techniques
rg/).
In this section, we report the results of our analysis on the
effectiveness of using a variational autoencoder for handling
Effectiveness of varying sample sizes in the latent space data imbalances. We employed a variety of data imbalance han-
dling algorithms, such as SMOTE, BorderlineSMOTE, ADASYN
After obtaining the optimal models, we were interested in
and class weight adjustment, to generate balanced datasets and
analyzing the effect of varying vector dimensions of the samples
classified them with the convolutional neural networks. We
in Z on the prediction of unseen samples using independent
also carried out experiments where no data imbalance handling
datasets. We constructed the VAE with various dimensions,
technique was used. We also searched for the optimal hyperpa-
denoted as dim(z), from 2 to 64. For each dimension, we repeated
rameters for the convolutional neural networks. To make our
the prediction experiments 10 times and calculated the average,
analysis results more robust and reliable, we repeated each
maximum, minimum and standard deviation of the G-mean
experiment 10 times and averaged the results. Table 1 displays
performance. (Please see Figure 4A and B for FAD- and FMN-
the performance comparison for the independent test, with the
binding site predictions on independent datasets, respectively.)
best sensitivity and MCC values highlighted in bold. All standard
From these figures, we observe that the FAD-binding site
deviation (STD) values are also presented.
prediction model achieves the highest maximum and highest
From Table 1, we observe that the combination of VAE and
average G-mean when dim(z) is equal to 16. Regarding FMN-
CNN yields the best G-mean, sensitivity and MCC performance
binding site prediction, the models with dim(z) = 16 also obtained
on both the FAD and FMN independent tests. For the FAD dataset,
the highest average score and lowest standard deviation score.
our model obtained the best G-mean (0.72), which is far bet-
Considering the robustness of the models, we chose the models
ter than the second-best technique (CNN + Classweight with G-
with the highest average G-means (dim(z) = 16 in both cases) for
mean = 0.64). For the FMN dataset, our method is also better than
the next analysis.
the best approach (CNN + BorderlineSMOTE with G-mean = 0.68).
The reason for less improvement may be the smaller number
Effectiveness of various contributions of reconstruction of samples in the FMN training dataset compared to the FAD
loss and latent loss training dataset.
In this analysis, we varied the contribution of the reconstruction
loss and the latent loss to the total loss of the VAE models. We
Performance comparison with traditional machine
specified the total loss of the VAE as
learning algorithms
We further compared the effectiveness of our method with
Total loss = a ∗ reconstruction_loss + (1 − a) ∗ latent_loss (2)
several traditional machine learning algorithms, namely, Ran-
domForest, Support Vector Machine, XGBoost and AdaBoost (pre-
with α chosen from a list of 0.1, 0.2, 0.3, 0.4 and 0.5. viously called ADaBoost.M1). For each of these algorithms, we
The best models from the previous analysis were chosen searched for the best hyperparameters as well as best data
for this analysis on independent datasets. We report the best imbalance handling algorithms using a grid search and recorded
G-mean scores over five experiments for each value of α in the best G-mean performance. Similar to experiments reported
Figure 5, where it can be seen that the distributions of recon- in Table 1, we repeated each experiment 10 times and averaged
struction loss and latent loss have an impact on the prediction the best results. Table 2 shows the independent test perfor-
performance. The G-mean scores range from 0.72 to 0.85 for mance on five standard metrics for FAD and FMN datasets, with
FAD-binding site prediction and from 0.67 to 0.71 for FMN- the highest scores highlighted in bold. Our approach, with G-
binding site prediction. However, for both datasets, the trends are mean values of 0.76 and 0.73, MCC values of 0.48 and 0.51 on
not clear. FAD and FMN independent test data, respectively, outperforms
8 Nguyen et al.

Table 1. Performance comparison among different data imbalance handling techniques on independent tests. (Each experiment was repeated
10 times and average results are reported in format: m ± d, where m is the average and d is the standard deviation across the 10 experiments)

FAD

Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab277/6329407 by Yuan Ze University, khucnam@yahoo.com on 01 August 2021


Model Acc Spec Sen MCC G-mean

CNN+ Adasyn 0.98 ± 0.0 1.00 ± 0.0 0.40 ± 0.09 0.52 ± 0.04 0.62 ± 0.07
CNN+ Smote 0.98 ± 0.0 1.00 ± 0.0 0.36 ± 0.07 0.49 ± 0.03 0.60 ± 0.06
CNN+ BordelineSmote 0.98 ± 0.0 1.00 ± 0.0 0.40 ± 0.06 0.53 ± 0.04 0.63 ± 0.05
CNN+ Classweight 0.95 ± 0.01 0.96 ± 0.01 0.44 ± 0.11 0.26 ± 0.07 0.64 ± 0.08
CNN+ NoBalance 0.98 ± 0.0 1.00 ± 0.0 0.37 ± 0.08 0.53 ± 0.06 0.61 ± 0.06
CNN+ VAE (Proposed model) 0.98 ± 0.01 0.99 ± 0.01 0.52 ± 0.01 0.48 ± 0.05 0.72 ± 0.07
FMN

Acc Spec Sen MCC G-mean


CNN+ Adasyn 0.88 ± 0.01 0.95 ± 0.01 0.48 ± 0.06 0.47 ± 0.04 0.67 ± 0.04
CNN+ Smote 0.88 ± 0.01 0.95 ± 0.01 0.48 ± 0.06 0.47 ± 0.04 0.67 ± 0.04
CNN+ BordelineSmote 0.88 ± 0.01 0.94 ± 0.01 0.50 ± 0.08 0.48 ± 0.06 0.68 ± 0.06
CNN+ Classweight 0.84 ± 0.02 0.93 ± 0.03 0.31 ± 0.09 0.27 ± 0.04 0.53 ± 0.07
CNN+ NoBalance 0.86 ± 0.01 0.95 ± 0.01 0.28 ± 0.05 0.29 ± 0.04 0.51 ± 0.04
CNN+ VAE (Proposed model) 0.88 ± 0.01 0.94 ± 0.01 0.52 ± 0.05 0.49 ± 0.04 0.70 ± 0.03

Table 2. Comparison between the proposed method and traditional machine learning algorithms for FAD- and FMN-binding site prediction (Each
experiment was repeated 10 times and average results are reported in format: m ± d, where m is the average and d is the standard deviation
across the 10 experiments)

FAD

Predictor Acc Spec Sen MCC G-mean

Random Forest 0.98 ± 0.0 0.99 ± 0.0 0.44 ± 0.0 0.43 ± 0.0 0.66 ± 0.0
XGBoost 0.98 ± 0.0 0.99 ± 0.0 0.42 ± 0.0 0.43 ± 0.0 0.65 ± 0.0
SVM 0.98 ± 0.0 0.99 ± 0.0 0.53 ± 0.0 0.46 ± 0.0 0.72 ± 0.0
AdaBoost 0.89 ± 0.0 0.90 ± 0.0 0.56 ± 0.0 0.20 ± 0.0 0.71 ± 0.0
Proposed model 0.97 ± 0.01 0.98 ± 0.01 0.59 ± 0.1 0.48 ± 0.06 0.76 ± 0.06
FMN
Random Forest 0.84 ± 0.0 0.92 ± 0.0 0.36 ± 0.0 0.29 ± 0.0 0.57 ± 0.0
XGBoost 0.85 ± 0.0 0.93 ± 0.0 0.40 ± 0.0 0.36 ± 0.0 0.61 ± 0.0
SVM 0.69 ± 0.0 0.69 ± 0.0 0.69 ± 0.0 0.28 ± 0.0 0.69 ± 0.0
AdaBoost 0.66 ± 0.0 0.67 ± 0.0 0.64 ± 0.0 0.22 ± 0.0 0.65 ± 0.0
Proposed model 0.88 ± 0.01 0.94 ± 0.01 0.57 ± 0.03 0.51 ± 0.02 0.73 ± 0.02

(Optimal hyperparameters: FAD: Random Forest: num features = 50, num trees = 500, imbalance data handling algorithm (IDHA) = BorderlineSMOTE; XGBoost: learning
rate = 0.1, n_estimators =100, max_depth = 10, early_stopping_rounds = 10, IDHA = Adasyn; SVM: c = 0.01, g = 0.01, IDHA = BorderlineSMOTE; AdaBoost: num features = 500,
IDHA = BorderlineSMOTE; FMN: Random Forest: num features = 100, num trees = 200, IDHA = BorderlineSMOTE or SMOTE; XGBoost: learning rate = 0.1, n_estimators
=1000, max_depth = 5, early_stopping_rounds = 10, IDHA = Adasyn; SVM: c = 0.001, g = 0.01, IDHA = Adasyn; AdaBoost: num features = 100, IDHA = Adasyn).
The highest scores of each metrics are highlighted in bold.

all surveyed algorithms in predicting FAD- and FMN-binding by Mishra et al. [45] and LPIcom by Singh et al. [46] and
sites. Furthermore, we obtained sensitivity increases of 6–15% reported the performance comparison in Supplementary Table
on FAD and 17–21% if maintaining similar levels of accuracy S8 available online at http://bib.oxfordjournals.org/. We found
and specificity. However, we found that the results of traditional that our approach outperformed the previous methods in almost
machine learning algorithms are more stable with standard all metrics.
deviation being equal to 0. We guess that the reason for this is the
greater number of parameters in the VAEs and CNNs compared Reproducing the results
to those from the traditional machine learning algorithms.
To help reproduce the results, we have freely provided the
VAE_CNN_ETCBinder source codes and data at https://github.
Comparison with existing FAD- and FMN-biding com/khucnam/VAE_CNN_ETCBinder. Using our source codes,
site predictors biologists and other researchers can rebuild the model on their
Supplementary Table S7, available online at http://bib.oxfordjou own local machine for prediction. In the near future, we hope
rnals.org/, in the Supplementary data lists previous studies that to provide users lacking programming skills with a web-based
predicted FAD- and FMN-binding sites in electron transport prediction server.
proteins and in general proteins. To make a fair comparison
with respect to prediction performance, we need to use either
the same datasets or the same methods. However, the same
Conclusion
table shows that some of the datasets and working prediction In this study, we integrated a novel data imbalance handling
servers were unavailable when we performed the comparison. technique based on a VAE with a CNN to improve the identi-
Therefore, we inputted our independent datasets into FADPred fication of binding sites in electron transport proteins. Using
Addressing data imbalance using deep learning 9

variational autoencoders and deep neural networks, we were References


able to generate new samples that hold important minority class
1. Yang J, Roy A, Zhang Y. BioLiP: a semi-manually curated
characteristics due to the great learning ability of the VAE. This
database for biologically relevant ligand–protein interac-
helped us create a high-quality, balanced dataset for training the
tions. Nucleic Acids Res 2012;41(D1):D1096–103.

Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab277/6329407 by Yuan Ze University, khucnam@yahoo.com on 01 August 2021


model. Apart from variational autoencoders, the convolutional
2. Lin W-J, Chen JJ. Class-imbalanced classifiers for high-
neural network employed for the classification tasks performed
dimensional data. Brief Bioinform 2013;14(1):13–26.
well on our generated data. We validated our model in a rigorous
3. Chawla NV, Bowyer KW, Hall LO, et al. SMOTE: syn-
manner and found consistent and significant improvements
thetic minority over-sampling technique. J Artif Intell Res
over all competing approaches for handling data imbalances.
2002;16:321–57.
We also greatly improved the ability to identify binding sites (6–
4. He H, Bai Y, Garcia EA, et al. ADASYN: adaptive synthetic
21%) while obtaining a similar or higher level of accuracy and
sampling approach for imbalanced learning. In: 2008 IEEE
specificity. Finally, our models show better results compared to
International Joint Conference on Neural Networks (IEEE World
existing predictors for the same binding types, which further
Congress on Computational Intelligence). Hong Kong, China:
demonstrates the effectiveness of our method. As our input
IEEE, 2008.
of the VAEs is position-scoring matrices, this method can be
5. Han H, Wang W-Y, Mao B-H. Borderline-SMOTE: a new
applied to other protein prediction problems with imbalanced
over-sampling method in imbalanced data sets learning. In:
data. In order to obtain good performance with the neural net-
International Conference on Intelligent Computing. Hefei, China:
works, the basic principal for tuning network hyperparame-
Springer, 2005.
ters should be followed (i.e. random or grid search on hyper-
6. Wang Y, Simon M, Bonde P, et al. Prognosis of right ven-
parameters, using learning rate finder or learning rate schedule,
tricular failure in patients with left ventricular assist device
using Keras tuner and applying early-stopping technique) with
based on decision tree with SMOTE. IEEE Trans Inf Technol
the aim of reducing reconstruction loss and the latent loss.
Biomed 2012;16(3):383–90.
7. Nakamura M, Kajiwara Y, Otsuka A, et al. Lvq-smote–
Key Points learning vector quantization based synthetic minority over–
sampling technique for biomedical data. BioData Mining
• Less than 30% of deep learning–based models for
2013;6(1):1–10.
ligand-binding site prediction address the data imbal- 8. Zeng, M., Zou B, Wei F, et al. Effective prediction of three
ance nature of the problem. common diseases by combining SMOTE with Tomek links
• Existing algorithms for handling imbalanced data only
technique for imbalanced medical data. In: 2016 IEEE Inter-
perform linear combinations of minority class sam- national Conference of Online Analysis and Computing Science
ples. (ICOACS). Chongqing, China: IEEE, 2016.
• This study proposes a new approach for producing
9. Mirza S, Mittal S, Zaman M. Decision support predictive
minority class samples by using a variational autoen- model for prognosis of diabetes using SMOTE and decision
coder and assuming that the added nonlinearity can tree. Int J Appl Eng Res 2018;13(11):9277–82.
be fully exploited by deep neural networks for predic- 10. Ma L, Fan S. CURE-SMOTE algorithm and hybrid algorithm
tion tasks. for feature selection and parameter optimization based on
• The effectiveness has been proved by comparison
random forests. BMC Bioinformatics 2017;18(1):1–18.
with conventional data imbalance techniques, tradi- 11. Ishwaran H, O’Brien R. Commentary: the problem of class
tional classification algorithms and existing predic- imbalance in biomedical data. J Thorac Cardiovasc Surg
tors on the same binding types. 2020;1:2.
12. Gao, R., Peng J, Nguyen L, et al. Classification of non-
tumorous facial pigmentation disorders using deep learning
Supplementary data and SMOTE. In: 2019 IEEE International Symposium on Circuits
and Systems (ISCAS). Sapporo, Hokkaido, Japan: IEEE, 2019.
Supplementary data are available online at Briefings in Bioin-
13. Xu Z, Shen D, Nie T, et al. A hybrid sampling algorithm
formatics.
combining M-SMOTE and ENN based on Random Forest for
medical imbalanced data. J Biomed Inform 2020;107:103465.
Data availability 14. Wang K-J, Makond B, Chen KH, et al. A hybrid classifier
combining SMOTE with PSO to estimate 5-year survivability
The surveyed datasets can be downloaded at https://github. of breast cancer patients. Appl Soft Comput 2014;20:15–24.
com/khucnam/VAE_CNN_ETCBinder. 15. Abraham B, Nair MS. Computer-aided diagnosis of clinically
significant prostate cancer from MRI images using sparse
Acknowledgment autoencoder and random forest classifier. Biocybern Biomed
Eng 2018;38(3):733–44.
We would like to thank Ha Phong Nguyen, a Ph.D. candidate 16. Kurniawati YE, Permanasari AE, Fauziati S. Adaptive
in Oulu University for this helpful comments about our synthetic-nominal (adasyn-n) and adaptive synthetic-
methodology. knn (adasyn-knn) for multiclass imbalance learning on
laboratory test data. In: 2018 4th International Conference on
Science and Technology (ICST). Yogyakarta, Indonesia: IEEE,
Funding 2018.
17. Xie C, du R, Ho JWK, et al. Effect of machine learning re-
Ministry of Science and Technology, Taiwan, R.O.C. (MOST sampling techniques for imbalanced datasets in 18 F-FDG
109-2221-E-155-045, MOST 109-2811-E-155-505). PET-based radiomics model on prognostication performance
10 Nguyen et al.

in cohorts of head and neck cancer patients. Eur J Nucl Med 31. Apweiler R, Bairoch A, Wu CH, et al. UniProt: the
Mol Imaging 2020;47(12):2826–35. universal protein knowledgebase. Nucleic Acids Res
18. Molinari F, Raghavendra U, Gudigar A, et al. An efficient 2004;32(suppl_1):D115–9.
data mining framework for the characterization of symp- 32. Altschul SF, Madden TL, Schäffer AA, et al. Gapped BLAST

Downloaded from https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab277/6329407 by Yuan Ze University, khucnam@yahoo.com on 01 August 2021


tomatic and asymptomatic carotid plaque using bidimen- and PSI-BLAST: a new generation of protein database search
sional empirical mode decomposition technique. Med Biol programs. Nucleic Acids Res 1997;25(17):3389–402.
Eng Comput 2018;56(9):1579–93. 33. Chen K, Mizianty MJ, Kurgan L. Prediction and analysis
19. Gulrajani I, Kumar K, Ahmed F, et al. Pixelvae: a latent of nucleotide-binding residues using sequence and
variable model for natural images. arXiv preprint sequence-derived structural descriptors. Bioinformatics
arXiv:1611.05013. 2016. 2012;28(3):331–41.
20. Fraccaro M, Sønderby SK, Paquet U, et al. Sequential 34. Chen K, Mizianty MJ, Kurgan L. ATPsite: sequence-based predic-
neural models with stochastic layers. arXiv preprint tion of ATP-binding residues. In: Proteome Science. Hong Kong,
arXiv:1605.07571. 2016. China: BioMed Central, 2011.
21. Liu, D. and G. Liu. A transformer-based variational autoen- 35. Lovric M. International Encyclopedia of Statistical Science. Berlin,
coder for sentence generation. In: 2019 International Joint Heidelberg, Germany: Springer, 2011.
Conference on Neural Networks (IJCNN). Budapest, Hungary: 36. Wang S, Yao X. Using class imbalance learning for software
IEEE, 2019. defect prediction. IEEE Trans Reliab 2013;62(2):434–43.
22. Blaschke T, Olivecrona M, Engkvist O, et al. Application of 37. Tang Y, Zhang YQ, Chawla NV, et al. SVMs modeling for
generative autoencoder in de novo molecular design. Mol highly imbalanced classification. IEEE Trans Syst Man Cybern
Inform 2018;37(1–2):1700123. B Cybern 2008;39(1):281–8.
23. Mochel F, Haller RG. Energy deficit in Huntington disease: 38. Gong J, Kim H. RHSBoost: improving classification
why it matters. J Clin Invest 2011;121(2):493–9. performance in imbalance data. Comput Stat Data Anal
24. Ritov VB, Menshikova EV, Azuma K, et al. Deficiency of 2017;111:1–13.
electron transport chain in human skeletal muscle mito- 39. Guo H, Liu H, Wu C, et al. Logistic discrimination based on G-
chondria in type 2 diabetes mellitus and obesity. Am J Physiol mean and F-measure for imbalanced problem. J Intell Fuzzy
Endocrinol Metab 2010;298(1):E49–58. Syst 2016;31(3):1155–66.
25. Barile M, Anna Giancaspero T, Brizio C, et al. Biosynthesis of 40. Aurelio YS, de Almeida GM, de Castro CL, et al. Learning from
flavin cofactors in man: implications in health and disease. imbalanced data sets with weighted cross-entropy function.
Curr Pharm Des 2013;19(14):2649–75. Neural Process Lett 2019;50(2):1937–49.
26. Lienhart W-D, Gudipati V, Macheroux P. The human flavo- 41. Liu X-Y, Wu J, Zhou Z-H. Exploratory undersampling for
proteome. Arch Biochem Biophys 2013;535(2):150–62. class-imbalance learning. IEEE Trans Syst Man Cybern B
27. Liu Y, Fiskum G, Schubert D. Generation of reactive oxygen (Cybern) 2008;39(2):539–50.
species by the mitochondrial electron transport chain. J 42. Oh S-H. Error back-propagation algorithm for classification
Neurochem 2002;80(5):780–7. of imbalanced data. Neurocomputing 2011;74(6):1058–61.
28. Gonzalez-Cabo P, Ros S, Palau F. Flavin adenine dinucleotide 43. Wang S, Minku LL, Yao X. Dealing with multiple classes in online
rescues the phenotype of frataxin deficiency. PLoS One class imbalance learning. In: IJCAI, Newyork, NY, USA. 2016.
2010;5(1):e8872. 44. Kingma DP, Ba J. Adam: a method for stochastic optimiza-
29. Vicens Q, Mondragón E, Reyes FE, et al. Structure–activity tion. arXiv preprint arXiv:1412.6980. 2014.
relationship of flavin analogues that target the flavin 45. Mishra NK, Raghava GP. Prediction of FAD interacting
mononucleotide riboswitch. ACS Chem Biol 2018;13(10): residues in a protein from its primary sequence using evo-
2908–19. lutionary information. BMC Bioinformatics 2010;11(1):1–6.
30. Kuppuraj G, Kruise D, Yura K. Conformational behavior of 46. Singh H, Srivastava HK, Raghava GP. A web server for anal-
flavin adenine dinucleotide: conserved stereochemistry in ysis, comparison and prediction of protein ligand binding
bound and free states. J Phys Chem B 2014;118(47):13486–97. sites. Biol Direct 2016;11(1):1–14.

View publication stats

You might also like