You are on page 1of 2

70 Protein & Peptide Letters, 2012, 19, 70-78

A Novel Sequence-Based Method for Phosphorylation Site Prediction with


Feature Selection and Analysis

Zhi-Song He1,*, Xiao-He Shi2, Xiang-Ying Kong2,4, Yu-Bei Zhu3 and Kuo-Chen Chou5

1
CAS-MPG Partner Institute for Computational Biology, Shanghai Institutes for Biological Sciences (SIBS), Chinese
Academy of Sciences (CAS), 320 Yueyang Road, Shanghai 200031, China; 2Institute of Health Sciences, SIBS, CAS and
Shanghai Jiaotong University School of Medicine (SJTUSM), China; 3Department of Chemistry, College of Sciences,
Shanghai University, 99 Shang-Da Road Shanghai 200444, China; 4State Key Laboratory of Medical Genomics, Ruijin
Hospital, Shanghai Jiaotong University, 197 Rui Jin Road II, Shanghai 200025, People’s Republic of China; 5Gordon Life
Science Institute, 13784 Torrey Del Mar, San Diego, California 92130, USA

Abstract: Phosphorylation is one of the most important post-translational modifications, and the identification of protein
phosphorylation sites is particularly important for studying disease diagnosis. However, experimental detection of phos-
phorylation sites is labor intensive. It would be beneficial if computational methods are available to provide an extra ref-
erence for the phosphorylation sites. Here we developed a novel sequence-based method for serine, threonine, and tyro-
sine phosphorylation site prediction. Nearest Neighbor algorithm was employed as the prediction engine. The peptides
around the phosphorylation sites with a fixed length of thirteen amino acid residues were extracted via a sliding window
along the protein chains concerned. Each of such peptides was coded into a vector with 6,072 features, derived from
Amino Acid Index (AAIndex) database, for the classification/detection. Incremental Feature Selection, a feature selection
algorithm based on the Maximum Relevancy Minimum Redundancy (mRMR) method was used to select a compact fea-
ture set for a further improvement of the classification performance. Three predictors were established for identifying the
three types of phosphorylation sites, achieving the overall accuracies of 66.64%, 66.11% and 66.69%, respectively. These
rates were obtained by rigorous jackknife cross-validation tests.
Keywords: Data mining, phosphorylation, aaindex; mRMR, machine learning approach.

INTRODUCTIONS also act as evidences that protein phosphorylation is a gen-


eral and fundamental regulatory process. In yeast, it was
Phosphorylation is one of the most important post-
reported that the use of proteome chip provided a first-
translational modifications which play a fundamental role in
generation phosphorylation map [9]. High-throughput eu-
most of the cellular regulatory pathways [1-3]. Phosphoryla- karyotes studies such as recent large-scale analysis of the
tion of a protein is catalyzed by protein kinases (PKs) which
human phosphoproteome by quantitative mass spectrometry,
exhibit different preferences in the modification of substrate
in which the time courses of more than 6,600 protein phos-
residues. The phosphorylation of hydroxyl group of the side
phorylation sites in HeLa cells were measured that response
chains of serine, threonine, and tyrosine plays extensive roles
to epidermal growth factor (EGF) stimulation, enables re-
in living cells for signal transduction cascades [4, 5]. It is in
searchers to study biological systems from a global perspec-
demand to identify the substrates accompanied with their tive [10].
phosphorylation sites in large-scale phosphoproteome, for it
may aid researchers on disease analysis and drug design [6]. However, experimental detection of protein phosphoryla-
tion sites is time-consuming and often limited by the avail-
The existing ways to identify the phosphorylation sites
ability and optimization of enzymatic reactions. The predic-
include the experimental and computational methods. In the
tion of phosphorylation sites with their specific kinase using
past few years, much effort was made to analyze bacterial computational approaches based on their primary sequences
phosphoproteomes with 2D gels followed by mass spectro-
could be of help in this regard, because computational meth-
metric identification of [32P]-labeled spots [7] and titanium
ods can provide automatic and fast annotations, which can
oxide chromatography, followed by ion trap-Fourier trans-
hopefully be in turn used as a reference and guideline for
form ion cyclotron resonance mass spectrometer [8]. These
conducting experiments and for the interpretation of phos-
studies significantly increase the number of bacterial proteins
phoproteomic data. In view of this, many in silico algorithms
which can be phosphorylated on Ser/Thr/Tyr residues, and have been proposed. Meanwhile, the web servers enabling
users to submit and analyze their own data have also been
*Address correspondence to this author at the CAS-MPG Partner Institute developed to facilitate biological experiments. For example,
for Computational Biology, Shanghai Institutes for Biological Sciences the consensus sequences, motifs, function modules and spe-
(SIBS), Chinese Academy of Sciences (CAS), 320 Yueyang Road, Shanghai cific residues have been adopted in the phosphorylation sites
200031, China; Tel: 86-21-54920498; Fax: 86-21-54920451;
E-mail: jfsamery@gmail.com prediction [11], other predictors including KinasePhos, based
on the profile hidden Markov model [12], PPSP adopting the
1875-5305/12 $58.00+.00 © 2012 Bentham Science Publishers
A Novel Predictor for Phosphorylation Site Protein & Peptide Letters, 2012, Vol. 19, No. 1 71

Bayesian Discriminant Method [13], PredPhospho and number was accepted. After removing redundancy, 71,637
PHOSIDA based on Support Vector Machine (SVM) [14, proteins were obtained and these were further refined by
15], information-Entropy based Phosphorylation Prediction only selecting those that had the phosphorylated sites con-
for the automatic detection of potential phosphorylation sites cerned. Eventually, we obtained the following three datasets:
[16], and NetworKIN which is a database of predicted (1) set-S containing 4,652 proteins with phosphorylated site
kinase–substrate relationship based on the latest human S; (2) set-T containing 1,817 proteins with phosphorylated
phosphoproteome and protein association network [17]. site T; (3) set-Y containing 771 proteins with phosphorylated
site Y.
In the computational biology and bioinformatics area,
machine learning and data mining methods have been widely From the three protein sets, the corresponding peptide
used by many researchers who have made great efforts to fragment sets were generated according to the following pro-
develop useful algorithms and software to investigate differ- cedures. If the protein set was set-S, all peptides that contain-
ent biological problems such as protein post-translation ing S would be picked out through a sliding window [31]
modification, protein subcellular locations, protein-DNA along each of the protein sequences concerned. Considering
interaction and so on [18-29]. the computational complexity, the peptide fragments were
composed of 13 residues with 6 residues upstream and 6
In this study, we developed a novel sequence-based
method for predicting phosphorylation site based on machine residues downstream of S located at the center. Similar op-
erations were taken for T and Y, respectively. Thus, we ob-
learning approach (NNA, Nearest Neighbor Algorithm)
tained three peptide fragment datasets as denoted by pep-S,
combining with feature selection (IFS and FFS based on
pep-T and pep-Y, respectively. The confirmed phosphoryla-
mRMR [30]) using an independent dataset. Three independ-
tion peptides were assigned as positive samples; while the
ent classifiers were developed for three types of phosphory-
rest were assigned as negative ones.
lation sites: serine, threonine and tyrosine, respectively. Fea-
ture selection was used in our study, not only for improving Since the size of the negative samples thus obtained was
the predictor's performance, but also for analyzing the fac- much larger than that of the positive samples, the robustness
tors affecting the appearance of phosphorylation site. Finally, of the prediction model could be affected by the unbalanced
we obtained overall success rates of 66.64%, 66.11% and data size between the positive and negative samples. To
66.69% respectively for serine, threonine and tyrosine phos- avoid this situation, we kept all the positive samples while
phorylation site predictions. Our analysis shows that some the negative samples were randomly selected from the origi-
particular positions (including the phosphorylation site) play nal data set until the number of the negative samples was
important roles during the phosphorylation process. It also twice as that of the positive ones. Finally, we obtained (1)
shows that hydrophobicity and the secondary structure of pep-S containing 13,855 positive samples (Online Support-
amino acids in the flanking sequences are important for the ing Information 1A) and 27,710 negative samples (Online
phosphorylation process. Supporting Information 1B), (2) pep-T containing 2,812
positive samples (Online Supporting Information 2A) and
MATERIALS AND METHODS 5624 negative samples (Online Supporting Information 2B),
and (3) pep-Y containing 1,259 positive samples (Online
Benchmark Datasets Supporting Information 3A) and 2,518 negative samples
In this paper, the dataset was built from the database (Online Supporting Information 3B).
UniProtKB/Swiss-Prot (Release 55.4). The method of how
to determine phosphorylated site in UniProtKB database was Feature Vector Construction
described in UniProt knowledgebase user manual. According
AAIndex [32, 33] contains hundreds of physiochemical
to the manual, the phosphorylation information was listed in or biological amino acid properties and each index represents
the MOD_RES key feature lines of FT field. The informa-
a kind of amino acid properties, presented in a form of nu-
tion contained phosphorylated residue name, and its site
meric matrixes. These features covered different aspect of
number. For example, if a phosphorylation information re-
amino acid characteristics, including alpha and turn propen-
cord was “FT MOD_RES 198 Phosphoserine”, it means that
sities, beta propensity, composition, hydrophobicity, phys-
the 198th residue in a sequence was marked as a phosphory-
icochemical properties and so on. AAindex consists of two
lated site by experiments and the residue name was serine. sections: AAindex1 for the collection of published amino
However, when there was a “Potential” or “By similarity”
acid indices and AAindex2 for the collection of published
mark, it was not accepted as a phosphorylated site in this
amino acid mutation matrices. Here AAIndex1 section was
study because these marks indicate that the result was not
used. So far, several previous studies have already used
proved by experiments but by some sort of deduction. In a
AAIndex properties as parts of the features they considered
sequence, the following residues were possible to be the
when applied machine learning approach to biological stud-
phosphorylated site: Serine, Threonine, Tyrosine, Aspartate, ies [34, 35]. AAIndex release 8.0, containing 562 indices,
Histidine, or Cysteine, and, more rarely, Arginine. In this
was used to represent our peptide samples. 506 of them were
paper, only serine (S), threonine (T) and tyrosine (Y) were
selected because indices containing missing values or am-
considered because the number of other phosphorylated resi-
biguous annotations were excluded. Hence a 12-residue pep-
dues was too few to have statistical significance.
tide sample could be encoded by a 506 12 = 6072D (di-
At first, 366,226 sequences were extracted from the data- mensional) vector, as formulated by
base with their accession numbers. In case a protein had
more than one accession number, only the first accession (
P = f0 , f1 , , fi , , fn ) ( i = 0, 1, , 6071 ) (1)

You might also like