Professional Documents
Culture Documents
Shufu Que1,2,#, Yongfei Wang2,#, Peixiang Chen2, Yu-Rong Tang3, Ziding Zhang3, and
Huaqin He1,2,*
1
Key Laboratory of Ministry of Education for Genetic, Breeding and Multiple Utilization of Crops, Fujian Agriculture
and Forestry University, Fuzhou 350002, China; 2College of Life Science, Fujian Agriculture and Forestry University,
Fuzhou 350002, China; 3State Key Laboratory of AgroBiotechnology, College of Biological Sciences, China Agricul-
tural University, Beijing 100193, China
Abstract: A series of elegant phosphorylation site prediction methods have been developed, which are playing an increas-
ingly important role in accelerating the experimental characterization of phosphorylation sites in phosphoproteins. In this
study, we selected six recently published methods (DISPHOS, NetPhosK, PPSP, KinasePhos, Scansite and PredPhospho)
to evaluate their performance. First, we compiled three testing datasets containing experimentally verified phosphoryla-
tion sites for mammalian, Arabidopsis and rice proteins. Then, we present the prediction performance of the tested meth-
ods on these three independent datasets. Rather than quantitatively ranking the performance of these methods, we focused
on providing an understanding of the overall performance of the predictors. Based on this evaluation, we found the fol-
lowing results: i) current phosphorylation site predictors are not effective for practical use and there is substantial need to
improve phosphorylation site prediction; ii) current predictors perform poorly when used to predict phosphorylation sites
in plant phosphoproteins, suggesting that a rice-specific predictor will be required to obtain confident computational anno-
tation of phosphorylation sites in rice proteomics research; and iii) the tested predictors are complementary to some ex-
tent, implying that establishment of a meta-server might be a promising approach to developing an improved prediction
system.
Keywords: Protein, Phosphorylation site, Prediction, Evaluation.
kinase families in which the number of known phosphoryla- with a quantitative understanding of the overall performance
tion sites is large. Therefore, these kinase-family-specific of the tested predictors. Moreover, we also point out some
prediction systems are inevitably limited to only a few kinase existing problems regarding protein phosphorylation site
families [18]. There are currently two important directions in prediction.
the field of protein phosphorylation site prediction. With the
increase of experimentally validated phosphorylation sites, MATERIALS AND METHODS
the first direction involves development of organism-specific
Datasets
predictors. Development of these predictors has been initi-
ated in some model organisms, specifically yeast [19] and We generated three phosphorylation site datasets consist-
Arabidopsis [20]. The second direction involves combining ing of experimentally verified mammalian, Arabidopsis and
phosphorylation site prediction with a network context of rice phosphoproteins, respectively. All the mammalian phos-
kinases and phosphoproteins. In addition to predicting phoproteins in Phospho.ELM (release version 7.0), which
kinase-family-specific phosphorylation sites, such a method contains 4,026 protein entries covering 16,428 phosphoryla-
would allow for the identification of specific kinases that are tion sites, were downloaded [22]. Three hundred mammalian
responsible in vivo for the predicted phosphorylation events. phosphoproteins were randomly selected from this total set
Existing phosphorylation site predictors play an increas- to compile the initial mammalian phosphoprotein dataset. To
ingly important role in complementing the experimental remove sequence redundancy, the initial mammalian phos-
characterization of protein phosphorylation. Generally, a phoproteins were filtered using a 30% sequence identity fil-
newly developed predictor must be intensively evaluated ter. Briefly, we performed pairwise BLAST searches for any
before its acceptance for publication in a peer-reviewed sequence pairs in the initial mammalian phosphoprotein
journal. Owing to the availability of many phosphorylation dataset. For sequence pairs with sequence identity > 30%,
site datasets, the accuracy of different predictors reported by only one sequence was left in the dataset. Thus, the phos-
the developers may be overestimated, because these methods phoprotein dataset size was reduced to 231 total proteins.
were optimized predominantly for the corresponding training The Arabidopsis and rice phosphorylation site datasets were
and testing data. Moreover, the current protein phosphoryla- obtained from recent literature, the feature table of the Swiss-
tion site predictors are derived primarily from mammalian Prot database (release 42.10) [23], and unpublished rice
phosphorylation site data and they may show lower predic- phosphorylation data from our laboratory. As in the mam-
tion accuracy for proteins from other organisms [19, 20]. To malian dataset, we filtered the Arabidopsis and rice phos-
our knowledge, different protein phosphorylation site predic- phoproteins using a sequence identity of 30% and compiled
tors have not been intensively benchmarked via some inde- them into the testing datasets for Arabidopsis and rice.
pendent datasets in the literature. Sikder and Zomaya (2009) The mammalian dataset consists of a total of 754 phos-
developed an independent dataset, collected from Phos- phoserine (pSer) sites, 135 phosphothreonine (pThr) sites
pho.ELM, to compare the performances of KinasePhos, Net- and 45 phosphotyrosine (pTyr) from 231 mammalian pro-
Phosk, DISPHOS, Scansite and PPSP [21]. However, this teins (Table 1). The Arabidopsis dataset contains a total of
independent dataset contained mainly mammalian proteins. 153 pSer sites, 47 pThr sites and 11 pTyr sites from 98
To explore the current phosphorylation site predictors’ appli- Arabidopsis proteins. The rice dataset consists of a total of
cation to plant proteomes, we decided to quantitatively in- 359 pSer sites, 263 pThr sites and 1 pTyr site from 105 rice
vestigate their performance when applied to Arabidopsis or proteins (Table 1). Since all of these phosphorylation sites
rice phosphoproteins. To this end, we constructed three in- are experimentally verified, they are regarded as (+) sites.
dependent phosphorylation site datasets and evaluated the The Ser, Thr and Tyr residues that are not annotated as phos-
performance of six phosphorylation site predictors on these phorylation sites within the above three datasets are regarded
datasets. We believe that this work will provide scientists
Table 1. The Number of Positive and Negative Phosphorylation Sites in the Three Phosphorylation Site Datasets
as ( ) sites (i.e., non-phosphorylation sites). The number of kinase families. In this evaluation, prediction was made by
( ) sites in each dataset is also listed in Table 1. We have considering all kinase groups and families.
made the three phosphorylation site datasets freely available
at http://210.34.95.200/hlx/, to allow other researchers to Performance Assessment
benchmark their methods against our results.
Four measurements—Sensitivity (Sn), Specificity (Sp),
Accuracy (Ac) and the Matthews correlation coefficient
Six Tested Phosphorylation Site Predictors
(MCC) — were employed to evaluate the performance of the
Each sequence from the three independent datasets was tested predictors (definitions below):
individually run through all six methods (KinasePhos, Net-
TP ,
PhosK, DISPHOS, Scansite, PPSP and PredPhospho) using Sn =
the web site servers provided by the developers. Aside from TP + FN
DISPHOS, all the tested predictors belong to kinase-specific TN ,
programs. To allow for a fair comparison of these six predic- Sp =
TN + FP
tors, we evaluate only whether a candidate site is correctly
predicted. The correctness of the predicted kinase family TP + TN ,
Ac =
information is not considered in our analysis. The algorithms TP + FP + TN + FN
and parameter settings of the evaluated predictors are briefly
described below. and
Table 2. The Overall Predictive Performance for the Phosphorylation Site Datasets
phosphorylated sites, inducing an imbalance of ( ) sites and Phos, PredPhospho, DISPHOS and NetPhosK showed the
(+) sites. Such a phenomenon is further exemplified in the highest MCC on the rice dataset, while the other two predic-
prediction performance of PredPhospho. When tested in tors still demonstrated their highest MCC values on the
mammalian proteins, PredPhospho performed reasonably mammalian dataset. Since the number of pTyr sites in the
well (Sn = 78.88% and Sp = 77.65%). However, the ratio of rice dataset is very limited, we are unable to conclude that
(+) sites to ( ) sites in the mammalian dataset is approxi- these four methods are favorable for pTyr site prediction.
mately 1:32, indicating that the prediction result of Pred-
Phospho indeed contains too many false positives, resulting The Tested Predictors Exhibited Better Performance on
in a relatively low MCC value (0.1971). For a machine Mammalian Proteins than Plant Proteins
learning method-based phosphorylation site predictor, it is
Each predictor exhibited the best performance (i.e., the
important to note that an optimized ratio of (+) sites to ( )
sites in the training dataset can often result in a better predic- highest MCC) on the mammalian dataset and the poorest
performance (i.e., the lowest MCC) on either the rice or
tion model.
Arabidopsis dataset (Table 2). It is likely that this is caused
by the predictors having been derived primarily from mam-
The Prediction Performance of pSer, pThr and pTyr
malian phosphorylation sites. We conclude that the tested
Sites
predictors might not adapt well to predicting plant protein
The tested predictors demonstrated different performance phosphorylation sites. Therefore, a plant-specific predictor
on pSer, pThr and pTyr site predictions (Fig. 1). Generally, may yield better performance when applied to plant proteins
the predictors have similar performance on pSer and pThr than a generic predictor. For instance, a plant-specific pre-
sites and better predictive performance on pTyr site. Because dictor called PhosPhAt has been developed to predict pSer
the number of pTyr sites is limited in all the testing datasets, sites in Arabidopsis. PhosPhAt was benchmarked to reveal
we can not conclude that the prediction of pTyr sites is easier better performance than some generic predictors [20]. With
than the prediction of pSer and pThr sites. In terms of pSer the accumulation of experimentally validated phosphoryla-
and pThr site prediction on the three datasets, all the tested tion sites in rice, it is hoped that a rice-specific predictor will
predictors indicated better performance (MCC) on the mam- be available in the near future. Such a tool will inevitably
malian dataset than on either plant (Arabidopsis and rice) facilitate more reliable computational annotation of phos-
dataset (Fig. 1). However, for pTyr site prediction, Kinase- phorylation sites in rice phosphoproteins.
68 Protein & Peptide Letters, 2010, Vol. 17, No. 1 Que et al.
FUTURE PERSPECTIVES
In statistical predictions, three cross-validation methods
are often used to examine a predictor for its effectiveness in
a practical application: the independent dataset test, the sub-
sampling test, and the jackknife test [25]. However, as dem-
onstrated by Chou and Shen [26, 27], the jackknife test is
deemed the most objective of these three tests and can al-
ways yield a unique result for a given benchmark dataset.
For these reasons, the jackknife test has been increasingly
used by investigators to examine the accuracy of various
predictors [28-35]. However, a feasible way to examine the
quality of the existing web-site predictors is the independent
dataset test, as performed in this study.
In this study, we constructed three phosphorylation site
datasets (i.e., the datasets of mammal, Arabidopsis and rice
proteins) to benchmark the prediction performance of six
phosphorylation site predictors. It is worth mentioning that
the three testing datasets can not be regarded as the gold
standard. In particular, the unreliable (-) sites may impact the
evaluation. Some phosphorylation sites that have not been
experimentally validated are likely to be represented as (-)
sites in this study, contributing to the “noise” of our evalua-
tion. Moreover, the size of these three datasets may not be
large enough to represent the complete sequence space of
phosphoproteins. Despite these shortcomings, this evaluation
provides a quantitative understanding of the overall perform-
ance of current phosphorylation-site predictors.
We expect that both the developers of phosphorylation
site predictors and the interested experimental scientists will
benefit from our findings in this work. We would like to em-
phasize the following findings we have obtained. First, the
current phosphorylation site predictors are not effective for
practical use. In particular, the current predictors often yield
overprediction at the whole-sequence level (i.e., a lower frac-
tion of the predicted sites are actually true positives). There-
fore, reduction of overprediction remains an important direc-
tion for development of new phosphorylation-site predictors.
Second, the current predictors perform less well when they
are applied to predicting phosphorylation sites in Arabidop-
sis and rice phosphoproteins. Therefore, it is important to
develop organism-specific phosphorylation site predictors.
With the accumulated phosphorylation site data available in
many model organisms, such predictors have recently been
Figure 1. The predictive performance (MCC) of the six tested pre-
developed. Finally, we have found that the tested predictors
dictors on pSer (A), pThr (B) and pTyr (C) sites.
are complementary to some extent, suggesting that estab-
lishment of a meta-server may result in better predictive per-
The Complementarity Between Different Predictors
formance.
We found that different predictors are often complemen-
tary to some extent. For example, the prediction performance ACKNOWLEDGEMENTS
of KinasePhos was 98.93% (Sn) and 1.62% (Sp) on the
This study was supported in part by grant 20070019050
mammalian dataset and outperformed Scansite (Sn = 42.08%
of the State Education Ministry, China, the Key Science Pro-
and Sp = 85.08%) with a 56.85% higher sensitivity and an
gram of Fujian (no. 2009N0011), the National Natural Sci-
83.46% lower specificity. This suggested that an effective
ence Foundation of China (no. 30020170), and the Key Pro-
combination of KinasePhos and Scansite may result in a new gram of Ecology, Fujian Province (0608507).
prediction system with increased performance. Indeed, such
an effort has recently been initiated. Wan et al. (2008) used a
Evaluation of Protein Phosphorylation Site Predictors Protein & Peptide Letters, 2010, Vol. 17, No. 1 69
REFERENCES [19] Ingrell, C.R.; Miller, M.L.; Jensen, O.N.; Blom, N. NetPhosYeast:
prediction of protein phosphorylation sites in yeast. Bioinformatics,
[1] Peck, S.C. Early phosphorylation events in biotic stress. Curr. 2007, 23(7), 895-897.
Opin. Plant Biol., 2003, 6(4), 334-338. [20] Heazlewood, J.L.; Durek, P.; Hummel, J.; Selbig, J.; Weckwerth,
[2] Blom, N.; Gammeltoft, S.; Brunak, S. Sequence and structure- W.; Walther, D.; Schulze, W.X. PhosPhAt: a database of
based prediction of eukaryotic protein phosphorylation sites. J. phosphorylation sites in Arabidopsis thaliana and a plant-specific
Mol. Biol., 1999, 294(5), 1351-1362. phosphorylation site predictor. Nucleic Acids Res., 2008,
[3] Chou, K.C.; Watenpaugh, K.D.; Heinrikson, R.L. A model of the 36(Database issue), D1015-1021.
complex between cyclin-dependent kinase 5(Cdk5) and the [21] Sikder, A.R.; Zomaya, A.Y. Analysis of protein phosphorylation
activation domain of neuronal Cdk5 activator. Biochem. Biophys. site predictors with an independent dataset. Int. J. Bioinform. Res.
Res. Commun., 1999, 259, 420-428. Appl., 2009, 5(1), 20-37
[4] Chou, K.C. Review: Structural bioinformatics and its impact to [22] Diella, F.; Cameron, S.; Gemund, C.; Linding, R.; Via, A.; Kuster,
biomedical science. Curr. Med. Chem., 2004, 11, 2105-2134. B.; Sicheritz-Ponten, T.; Blom, N.; Gibson, T.J. Phospho.ELM: a
[5] Hubbard, M.J.; Cohen, P. On target with a new mechanism for the database of experimentally verified phosphorylation sites in
regulation of protein phosphorylation. Trends Biochem Sci., 1993, eukaryotic proteins. BMC Bioinformatics, 2004, 5, 79.
18(5), 172-177. [23] Boutet, E.; Lieberherr, D.; Tognolli, M.; Schneider, M.; Bairoch,
[6] Khan, M.M.; Jan, A.; Karibe, H.; Komatsu, S. Identification of A. UniProtKB/Swiss-Prot. Methods Mol. Biol., 2007, 406, 89-112.
phosphoproteins regulated by gibberellin in rice leaf sheath. Plant [24] Wan, J.; Kang, S.; Tang, C.; Yan, J.; Ren, Y.; Liu, J.; Gao, X.;
Mol. Biol., 2005, 58(1), 27-40. Banerjee, A.; Ellis, L.B.; Li, T. Meta-prediction of phosphorylation
[7] Ficarro, S.B.; McCleland, M.L.; Stukenberg, P.T.; Burke, D.J.; sites with weighted voting and restricted grid search parameter
Ross, M.M.; Shabanowitz, J.; Hunt, D.F.; White, F.M. selection. Nucleic Acids Res., 2008, 36(4), e22.
Phosphoproteome analysis by mass spectrometry and its [25] Chou, K.C.; Zhang, C.T. Review: Prediction of protein structural
application to Saccharomyces cerevisiae. Nat. Biotechnol., 2002, classes. Crit. Rev. Biochem. Mol. Biol., 1995, 30, 275-349.
20(3), 301-305. [26] Chou, K.C.; Shen, H.B. Cell-PLoc: A package of web-servers for
[8] Ballif, B.A.; Villen, J.; Beausoleil, S.A.; Schwartz, D.; Gygi, S.P. predicting subcellular localization of proteins in various organisms.
Phosphoproteomic analysis of the developing mouse brain. Mol. Nat. Protocols, 2008, 3, 153-162.
Cell Proteomics, 2004, 3(11), 1093-1101. [27] Chou, K.C.; Shen, H.B. Review: Recent progresses in protein
[9] Lim, Y.P.; Diong, L.S.; Qi, R.; Druker, B.J.; Epstein, R.J. subcellular location prediction. Analy. Biochem., 2007, 370, 1-16.
Phosphoproteomic fingerprinting of epidermal growth factor [28] Xiao, X.; Chou, K.C. Digital coding of amino acids based on
signaling and anticancer drug action in human tumor cells. Mol. hydrophobic index. Protein Pept. Lett., 2007, 14, 871-875.
Cancer Ther., 2003, 2(12), 1369-1377. [29] Munteanu, C.B.; Gonzalez-Diaz, H.; Magalhaes, A.L. Enzymes/
[10] Nuhse, T.S.; Stensballe, A.; Jensen, O.N.; Peck, S.C. Phospho- non-enzymes classification model complexity based on
proteomics of the Arabidopsis plasma membrane and a new composition, sequence, 3D and topological indices. J. Theor. Biol.,
phosphorylation site database. Plant Cell, 2004, 16(9), 2394-2405. 2008, 254, 476-482.
[11] Hjerrild, M.; Gammeltoft, S. Phosphoproteomics toolbox: compu- [30] Munteanu, C.R.; Gonzalez-Diaz, H.; Borges, F.; de Magalhaes,
tational biology, protein chemistry and mass spectrometry. FEBS A.L. Natural/random protein classification models based on star
Lett., 2006, 580(20), 4764-4770. network topological indices. J. Theor. Biol., 2008, 254, 775-783.
[12] Blom, N.; Sicheritz-Ponten, T.; Gupta, R.; Gammeltoft, S.; Brunak, [31] Rezaei, M.A.; Abdolmaleki, P.; Karami, Z.; Asadabadi, E.B.;
S. Prediction of post-translational glycosylation and phospho- Sherafat, M.A.; Abrishami-Moghaddam, H.; Fadaie, M.;
rylation of proteins from the amino acid sequence. Proteomics, Forouzanfar, M. Prediction of membrane protein types by means of
2004, 4(6), 1633-1649. wavelet analysis and cascaded neural networks. J. Theor. Biol.,
[13] Huang, H.D.; Lee, T.Y.; Tzeng, S.W.; Horng, J.T. KinasePhos: a 2008, 254, 817-820.
web tool for identifying protein kinase-specific phosphorylation [32] Kannan, S.; Hauth, A.M.; Burger, G. Function prediction of
sites. Nucleic Acids Res., 2005, 33(Web Server issue), W226-229. hypothetical proteins without sequence similarity to proteins of
[14] Iakoucheva, L.M.; Radivojac, P.; Brown, C.J.; O'Connor, T.R.; known function. Protein Pept. Lett., 2008, 15, 1107-1116.
Sikes, J.G.; Obradovic, Z.; Dunker, A.K. The importance of [33] Chen, C.; Chen, L.; Zou, X.; Cai, P. Prediction of protein
intrinsic disorder for protein phosphorylation. Nucleic Acids Res., secondary structure content by using the concept of chou's pseudo
2004, 32(3), 1037-1049. amino acid composition and support vector machine. Protein Pept.
[15] Obenauer, J.C.; Cantley, L.C.; Yaffe, M.B. Scansite 2.0: Proteome- Lett., 2009, 16, 27-31.
wide prediction of cell signaling interactions using short sequence [34] Shen, H.B.; Chou, K.C. QuatIdent: A web server for identifying
motifs. Nucleic Acids Res., 2003, 31(13), 3635-3641. protein quaternary structural attribute by fusing functional domain
[16] Xue, Y.; Li, A.; Wang, L.; Feng, H.; Yao, X. PPSP: prediction of and sequential evolution information. J. Proteome Res., 2009, 8,
PK-specific phosphorylation site with Bayesian decision theory. 1577-1584.
BMC Bioinformatics, 2006, 7, 163. [35] Xiao, X.; Wang, P.; Chou, K.C. Predicting protein quaternary
[17] Xue, Y.; Zhou, F.; Zhu, M.; Ahmed, K.; Chen, G.; Yao, X. GPS: a structural attribute by hybridizing functional domain composition
comprehensive www server for phosphorylation sites prediction. and pseudo amino acid composition. J. Appl. Crystallogr., 2009,
Nucleic Acids Res., 2005, 33(Web Server issue), W184-187. 42, 169-173.
[18] Kim, J.H.; Lee, J.; Oh, B.; Kimm, K.; Koh, I. Prediction of
phosphorylation sites using SVMs. Bioinformatics, 2004, 20(17),
3179-3184.
Received: March 05, 2009 Revised: May 23, 2009 Accepted: May 27, 2009