Professional Documents
Culture Documents
doi: 10.1093/bioinformatics/bty302
Advance Access Publication Date: 14 April 2018
Original Paper
Sequence analysis
Abstract
Motivation: Efflux protein plays a key role in pumping xenobiotics out of the cells. The prediction
of efflux family proteins involved in transport process of compounds is crucial for understanding
family structures, functions and energy dependencies. Many methods have been proposed to clas-
sify efflux pump transporters without considerations of any pump specific of efflux protein fami-
lies. In other words, efflux proteins protect cells from extrusion of foreign chemicals. Moreover,
almost all efflux protein families have the same structure based on the analysis of significant
motifs. The motif sequences consisting of the same amount of residues will have high degrees of
residue similarity and thus will affect the classification process. Consequently, it is challenging but
vital to recognize the structures and determine energy dependencies of efflux protein families. In
order to efficiently identify efflux protein families with considering about pump specific, we devel-
oped a 2 D convolutional neural network (2 D CNN) model called DeepEfflux. DeepEfflux tried to
capture the motifs of sequences around hidden target residues to use as hidden features of fami-
lies. In addition, the 2 D CNN model uses a position-specific scoring matrix (PSSM) as an input.
Three different datasets, each for one family of efflux protein, was fed into DeepEfflux, and then a
5-fold cross validation approach was used to evaluate the training performance.
Results: The model evaluation results show that DeepEfflux outperforms traditional machine learn-
ing algorithms. Furthermore, the accuracy of 96.02%, 94.89% and 90.34% for classes A, B and C, re-
spectively, in the independent test results show that our model can perform well and can be used
as a reliable tool for identifying families of efflux proteins in transporters.
Availability and implementation: The online version of deepefflux is available at http://deepefflux.
irit.fr. The source code of deepefflux is available both on the deepefflux website and at http://140.
138.155.216/deepefflux/.
Contact: yien@saturn.yzu.edu.tw
Supplementary information: Supplementary data are available at Bioinformatics online.
C The Author(s) 2018. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com
V 3111
3112 S.W.Taju et al.
membrane proteins are layers acting as barriers against undesirable represents each pump specific of efflux proteins. In this study, we
compounds from outside the cells (Ranaweera et al., 2015). propose an approach that combines 2 D CNN model with PSSM
Although residing in membrane proteins, efflux proteins have specif- profiles as a reliable tool to find the hidden features of the dataset.
ic biological functions. Active efflux transport proteins skip mem- As shown in a series of recent publications (Chen et al., 2013;
brane barriers and transport waste substances through specific Lin et al., 2014; Liu et al., 2016; Chen et al., 2016; Jia et al., 2016;
pump of the efflux pump. In general, efflux proteins are classified Liu et al., 2015; Cheng et al., 2017; Feng et al., 2017; Liu et al.,
into five different families that represent five different pumps of ef- 2017), to develop a really useful sequence-based statistical predictor
flux transporters. They are major facilitator superfamily (MFS) (Pao for a biological system, one should observe the five-step rule (Chou
et al., 1998; Yan, 2013; Ranaweera et al., 2015), ATP-binding cas- 2011); i.e. making the following five steps very clear: (i) how to con-
sette superfamily (ABC) (Schneider and Hunke, 1998), small multi- struct or select a valid benchmark dataset to train and test the pre-
drug resistance family (SMR) (Chung and Saier Jr, 2001), dictor; (ii) how to formulate the biological sequence samples with an
resistance–nodulation–cell division superfamily (RND) (Nikaido effective mathematical expression that can truly reflect their intrinsic
and Takatsuka, 2009) and multiantimicrobial extrusion protein correlation with the target to be predicted; (iii) how to introduce or
family (MATE) (Kuroda and Tsuchiya, 2009). Since efflux proteins develop a powerful algorithm (or engine) to operate the prediction;
Fig. 1. Efflux protein families in the cell membrane. Efflux proteins are active transporters and localized in the cytoplasmic membrane of all kinds of cells. There
No. (Super) family Original data Identity < 20% Training data Testing data Name
electron transport protein identification (Le et al., 2017), and their flattening layer to flatten the input. Our first fully connected layer
results show significant improvements. Therefore, in this study, the contains 500 hidden nodes. After this step, sigmoid activation func-
PSI-BLAST (Altschul et al., 1997) and nonredundant protein data- tion was used to decide whether each neuron can be activated or
base were used to generate PSSM profiles from our fasta files of ef- not. Subsequently, we applied another fully connected layer with
flux proteins. Supplementary Figure S1 shows the way we generated two hidden nodes for binary classification of DeepEfflux model. For
acids are dominant in classes A, B, and C. Regarding bigrams and Table 2. Performance comparison of CNN model with different
trigrams, bigram LL and trigram LLL are more abundant in the pro- optimizers on PSSM data
tein sequences under study. These residues are hydrophobic and
Class Optimizer Sen Spec Acc MCC
polar in nature. Supplementary Figure S3 shows unigram, bigram,
trigram, four-gram and five-gram word cloud (left to right) created Class A Adam 92.00% 96.83% 95.45% 88.83%
from efflux protein motif sequences. From the figure, we can see the Adadelta 75.00% 96.21% 90.91% 74.94%
n-gram with highest frequencies of each family. AdaGrad 93.55% 85.96% 88.64% 76.94%
RMSProp 92.45% 94.31% 93.75% 85.45%
SGD 91.23% 89.92% 90.34% 78.93%
3.2 Different optimizers with PSSM features Class B Adam 53.85% 94.48% 91.48% 43.97%
Table 2 shows the predictive performance of DeepEfflux with five Adadelta 54.55% 96.36% 93.75% 48.89%
different optimizers on PSSM dataset. Best optimizer results for each AdaGrad 68.75% 93.13% 90.91% 53.79%
class are highlighted in bold. RMSProp 45.45% 95.76% 92.61% 39.58%
We can see from the table that by using Adam optimizer, class A SGD 93.75% 83.33% 85.23% 64.47%
can achieve best performance at 92.00%, 96.83%, 95.45% and Class C Adam 93.22% 82.76% 89.77% 76.68%
Table 4. Performance comparison of our model with QuickRBF The 2D CNN is an important type of deep learning model con-
classifier sisting of a convolution kernel and a filter being a set of motif detec-
tors that can learn from the data. By utilizing 2D CNN, we tried to
Method Data Sen Spec Acc MCC
capture meaningful motif features and even hidden features by scan-
QuickRBF classifier (5-fold cross validation) ning sequences around the target residue. The result of independent
PSSM Class A 95.14% 96.54% 95.92% 91.88% test shows that our prediction model performance exceeds that of
PSSM Class B 94.99% 86.63% 91.13% 82.25% traditional machine learning algorithm with accuracy of 96.02%,
PSSM Class C 92.96% 94.27% 93.68% 87.40% 94.89% and 90.34% for class A, B and C, respectively. These results
PSSMþAaindex Class A 97.14% 95.11% 96.18% 92.40%
were even enhanced by utilizing the combination of PSSM and
PSSMþAaindex Class B 91.60% 91.65% 91.67% 83.36%
AAIndex properties. This proves that DeepEfflux can capture im-
PSSMþAaindex Class C 95.04% 94.77% 94.95% 89.92%
portant motifs of efflux protein sequences in particular and in other
2 Dimensions CNN model (5-fold cross validation)
PSSM Class A 99.48% 96.44% 98.03% 96.10% types of protein in general. For this reason, our proposed model can
PSSM Class B 100.00% 91.02% 97.34% 93.64% be served as a reliable tool for prediction efflux protein families
PSSM Class C 89.90% 99.19% 94.97% 90.14% which helps biologists understand more about their structures as
Chou,K.C., and Zhang,C.T. (1995) Prediction of protein structural classes. Meher,P.K., et al. (2017) Predicting antimicrobial peptides with improved ac-
Crit. Rev. Biochem. Mol. Biol., 30, 275–349. curacy by incorporating the compositional, physico-chemical and structural
Chung,Y., and Saier,M. Jr. (2001) SMR-type multidrug resistance pumps. features into Chou’s general PseAAC. Sci. Rep., 7, 42362.
Curr. Opin. Drug Discov. Devel., 4, 237–245. Nikaido,H., and Takatsuka,Y. (2009) Mechanisms of RND multidrug efflux
Dauphin,Y.N., et al. (2015) RMSProp and equilibrated adaptive learning rates pumps. Biochim. Biophys. Acta, 1794, 769–781.
for non-convex optimization. CoRR abs/1502.04390. Ou,Y.Y. (2005). QuickRBF: a package for efficient radial basis function net-
Dehzangi,A., et al. (2015) Gram-positive and Gram-negative protein subcellu- works. http://csie.org/yien/quickrbf/.
lar localization by incorporating evolutionary-based descriptors into Chou‘s Ou,Y.-Y., and Shu-An,C. (2009) Using efficient RBF networks to classify
general PseAAC. J. Theor. Biol , 364, 284–294. transport proteins based on PSSM profiles and biochemical properties.
Duchi,J., et al.. (2011) Adaptive subgradient methods for online learning and International Work-Conference on Artificial Neural Networks. Springer,
stochastic optimization. J. Mach. Learn. Res, 2121–2159. Berlin, Heidelberg.
DuPont,F. (1992). Salt-induced changes in ion transport: regulation of pri- Ou,Y.Y. et al. (2008) TMBETADISC-RBF: discrimination of-barrel mem-
mary pumps and secondary transporters. In: Cooke. D., Clarkson. D. (eds) brane proteins using RBF networks and PSSM profiles. Comput. Biol.
Transport and Receptor Proteins of Plant Membranes. Plenum Press, New Chem, 32, 227–231.
York, NY, pp. 91–100. Ou,Y.Y. et al. (2013) Identification of efflux proteins using efficient radial
Feng,P.M., et al. (2013) iHSP-PseRAAAC: identifying the heat shock protein basis function networks with position-specific scoring matrices and bio-