You are on page 1of 5

17th IFAC Symposium on System Identification

17th IFAC
Beijing
17th Symposium
International
IFAC Symposium on SystemCenter
Convention
on Identification
17th IFAC
Beijing
October Symposium
International
19-21, on System
System
Convention
2015. Convention
Beijing,
Identification
Identification
Available online at www.sciencedirect.com
Center
China
Beijing
Beijing International
International Convention Center
Center
October 19-21,
October 19-21, 2015.
19-21, 2015. Beijing,
2015. Beijing, China
Beijing, China
China
October
ScienceDirect
IFAC-PapersOnLine 48-28 (2015) 017–021
A Novel Kinase-substrate Relation Prediction Method Based on Substrate
A
A Novel
Novel Kinase-substrate
Kinase-substrate Relation
andPrediction
Relation
Sequence Similarity Prediction Method Based
Based on
MethodNetwork
Phosphorylation on Substrate
Substrate
Sequence Similarity
Sequence Similarity and
and Phosphorylation
Phosphorylation Network
Network
* * * **
Haichun Li , Xiaoyi Xu , Huanqing Feng and Minghui Wang
* * * **
Haichun
Haichun Li
Li**,,, Xiaoyi
Xiaoyi Xu
Xu**,,, Huanqing
Huanqing Feng
Feng** and
and Minghui
Minghui Wang **
Wang**
Haichun Li Xiaoyi Xu Huanqing Feng and Minghui Wang
*School of Information Science and Technology, University of Science and Technology of China, Hefei 230027, China
*School of
*School of Information
Information Science
**Centers and Technology,
Technology,
for Biomedical
Science and University
Engineering, of Science
Science
University
University of and Technology
Technology
of Science
and of China,
China,
and Technology
of Hefei 230027,
of China,
Hefei 230027, China
China
*School of Information Science and Technology, University of Science and Technology of China, Hefei 230027, China
**Centers for
**Centers for Biomedical
for Biomedical Engineering,
Hefei, China, University
(E-mail:
Biomedical Engineering, of Science and
mhwang@ustc.edu.cn)
Engineering, University
University of
of Science
Science and Technology
and Technology of
Technology of China,
of China,
China,
**Centers
Hefei,
Hefei, China,
China, (E-mail:
(E-mail: mhwang@ustc.edu.cn)
mhwang@ustc.edu.cn)
Hefei, China, (E-mail: mhwang@ustc.edu.cn)
Abstract: Protein phosphorylation catalyzed by kinases plays essential roles in various intracellular
Abstract:
processes. Protein
Protein phosphorylation
Abstract: Therefore, the identification
phosphorylation catalyzed by
by kinases
of potential
catalyzed relationsplays
kinases between
plays essential
kinases
essential roles
andin
roles various
various isintracellular
substrates
in one of the
intracellular
Abstract:
processes. Protein
Therefore, phosphorylation
the identification catalyzed
of by
potential kinases
relations plays
between essential
kinases roles
and in various
substrates intracellular
is one
key areas
processes. in post-translational
Therefore, the modifications.
identification of Although
potential a number
relations between of
processes. Therefore, the identification of potential relations between kinases and substrates is onecomputational
kinases and approaches
substrates is haveof
one of the
ofbeen
the
the
key
key areas
designed,
areas in
in post-translational
most existing
post-translational modifications.
kinase-substrate
modifications. Although
relation
Although aa number
(KSR) prediction
number of
of computational
methods
computational approaches
only focus
approaches have
on
have been
protein
been
key areas in post-translational modifications. Although a number of computational approaches have been
designed, most
sequence information existing kinase-substrate
without relation
considering kinase-substrate (KSR) network.
prediction In methods
this paper, onlywe focus
proposed on protein
designed,
designed,
sequence
most
most existing
existing
information
kinase-substrate
kinase-substrate
without considering
relation
relation (KSR) prediction
(KSR) network.
kinase-substrate predictionIn methods
methods
this paper,
only
onlywe
focus
focus
proposed on aaprotein
on novel
protein
novel
KSR
sequenceprediction method
information called
without HeteSim-S
considering based both
kinase-substratesubstrate sequence
network.
sequence information without considering kinase-substrate network. In this paper, we proposed a novel In similarity
this paper, and
we phosphorylation
proposed a novel
KSR prediction
heterogeneous method called HeteSim-S based bothwhich
substrate
has sequence
been usedsimilarity and phosphorylation
KSR
KSR predictionnetwork
prediction method through
method called HeteSim algorithm,
called HeteSim-S
HeteSim-S based
based both
both substrate
substrate sequence
sequence in previous
similarity
similarity andstudies
and of similar
phosphorylation
phosphorylation
heterogeneous
search. Experiment
heterogeneous network network through
results
network through in HeteSim
through HeteSim algorithm,
kinase-substrate
HeteSim algorithm, which
heterogeneous
algorithm, which
which hashas
has beenbeen used
network
been used
used inin previous
show
in previousthat studies
our
previous studies
studies of
of similar
method can
similar
heterogeneous of similar
search.
effectively Experiment
search. Experimentpredict
Experiment results results
results in in
kinase-substrate kinase-substrate
relations
in kinase-substrate with heterogeneous
the
kinase-substrate heterogeneousAUC measure
heterogeneous network network
achieving
network show show that
0.842.
show that our
Besides,
that ourour methodmethod
the
method can AUCcan
can
search.
effectively
performancepredict
effectively predict kinase-substrate
on specific kinases is up
kinase-substrate relations
to 0.971.
relations with
withThethe AUC
theresult measure
measure achieving
AUCdemonstrates 0.842.
0.842. Besides,
that HeteSim-S
achieving the
the AUC
can remarkably
Besides, AUC
effectively predict kinase-substrate relations with the AUC measure achieving 0.842. Besides, the AUC
performance on specific kinases
improve the identification
performance on accuracy is up
by to 0.971.
incorporatingThe result demonstrates
substrate that
sequence similarity HeteSim-S
information can remarkably
in kinase-
performance on specific
specific kinases
kinases is
is up
up to
to 0.971.
0.971. The
The result
result demonstrates
demonstrates that
that HeteSim-S
HeteSim-S can
can remarkably
remarkably
improve
substrate
improve the identification
heterogeneous accuracy
networks. by incorporating substrate sequence similarity information in kinase-
improve the identification accuracy by incorporating substrate sequence similarity information in kinase-
the identification accuracy by incorporating substrate sequence similarity information in kinase-
substrate
substrate heterogeneous
heterogeneous networks.
networks.
© 2015,
substrate IFAC (International
heterogeneous Federation
networks. of Automatic Control)
Keywords: phosphorylation, kinase-substrate relation, heterogeneous networks, HeteSim Hosting by Elsevier Ltd. All rights reserved.
Keywords: phosphorylation,
Keywords:
Keywords: phosphorylation, kinase-substrate relation,
phosphorylation, kinase-substrate
kinase-substrate relation,
heterogeneous networks,
relation, heterogeneous
heterogeneous networks,
networks,
HeteSim
HeteSim
HeteSim


2003) and NetPhosK (Yin et al., 2012). Xu et al. presented a
1. INTRODUCTION 2003) and NetPhosK (Yin 2012). Xubased presented
1. novel
2003) SVM approach (Xu et
2003) and NetPhosK (Yin et al., 2012). Xu
and NetPhosK (Yin et al.,
et al.,
al., 2012).2014) Xu et
et al. on protein-aaa
al. presented
et al.
1. INTRODUCTION
1. INTRODUCTION
INTRODUCTION novel SVM approach (Xu et al., 2014) based
presented
on protein-
Phosphorylation catalyzed by protein kinases is one of the protein novel
novel SVM SVMinteractions
approachand
approach (Xusubstrate
(Xu et structure
al., 2014)
et al., 2014) based
based features,
on And
on protein-
protein-
Phosphorylation
most crucial and catalyzed by
wide-spread bypost-translational
protein kinases
kinases modifications
is one
one of of the protein
Song
the protein et interactions
al. introduced
protein interactions
interactions and and
IGPSsubstrate
(Song
and substrate structure
et
substrate structure al., 2012) features,
features, And
structure features, method by
And
Phosphorylation
Phosphorylation catalyzed protein
catalyzed bypost-translational is
protein kinases modifications
is one of the adopting And
most
of crucial and wide-spread Song
Song et et al.
et al. theintroduced
prediction
introduced IGPS
al. introduced IGPS
tool (Song
IGPS (Song in
(Song etGPS.et al.,
et al., 2012)
Despite these
2012) method
al., 2012) method
method by
methods
by
mostcellular
most crucial
crucial and proteins.
and wide-spread
wide-spreadProtein kinases influence
post-translational
post-translational cellular Song
modifications
modifications adopting the prediction tool in GPS. Despite
most of these
them
by
methods
of cellular
responses
of cellular proteins.
through
proteins. Protein
phosphorylating
Protein kinases
specific
kinases influence
substrates,
influence cellular
the
cellular achieve
adopting success
adopting the in identifying
the prediction
prediction tool new
tool in KSRs,
in GPS.
GPS. Despite these use
Despite these only
methods
methods
of cellular proteins. Protein kinases influence cellular protein most theof them use only
responses
phosphorylation
responses throughprocess
responses through
phosphorylating
through phosphorylatingmodifies nearly
phosphorylating
specific substrates,
third of the
specificonesubstrates,
specific substrates, all achieve
the
the
achieve success
sequence
achieve success
success
in
in identifying
information,
in identifying
identifying
newand
new KSRs,
new KSRs,
KSRs,ignore most
most ofkinase-substrate
of them
them use use only
only
phosphorylation
proteins and process
plays modifies
essential roles nearly
in one
regulating third of
numerous all protein
protein sequence
heterogeneous
sequence information,
network
information, and
which
and ignore
may
ignore the
the kinase-substrate
hampers further
kinase-substrate
phosphorylation process modifies nearly one
phosphorylation process modifies nearly one third of all heterogeneous third of all protein sequence information, and ignore the kinase-substrate
proteins
biologicaland andplays essential roles prediction network
performance which
of KSRs. may hampers further
proteins
proteins and
and physiological
plays
plays essential roles in
essential processes,
roles in regulating
in like the cell
regulating
regulating
numerous
cycle, heterogeneous
numerous
numerous heterogeneous network
network which
which may may hampers hampers further further
biological and prediction performance of KSRs.
signal
biological
biological and physiological
transduction,
and etc. (Olsen
physiological
physiological
processes,
processes,
processes,
like
et al., 2006;
like
the
the cell
like Ubersax
the cell et
cell
cycle,
al., prediction
cycle,
cycle, prediction performance
performance of
of KSRs.
KSRs.
signal Manning
2007;
signal transduction, et al., etc.2002).
(OlsenNormal al., 2006;
et al., Ubersaxreaction
biochemical al., So in this present paper, we propose a novel KSR prediction
et al.,
signal transduction,
transduction, etc.
etc. (Olsen
(Olsen et et al., 2006;
2006; Ubersax
Ubersax et et al., So in
method thiscalled
present paper,
HeteSim-S, we which
propose is aaabased
novel KSR theprediction
2007; Manning
will beManning
disrupted et al., 2002).
and 2002). Normal
the signaling biochemical be So
reaction So in
in this present
thiscalled
present paper,
paper, we propose
we which
propose novel on
novel KSR
KSR HeteSim
prediction
prediction
2007;
2007; Manning et
et al.,
al., 2002). Normal pathways
Normal biochemical
biochemical should
reaction
reaction method
(Shi
method et al.,
called HeteSim-S,
2014) algorithm
HeteSim-S, which is
for
is based
relevance
based on
on thesearch
the HeteSim
HeteSimby
will
rewiredbe disrupted and
if the corresponding the signaling pathways
kinasespathways should
or substrates be method called HeteSim-S, which is based on the HeteSim
will
will bebe disrupted
disrupted and
and thethe signaling
signaling pathways should are
should be (Shi
be incorporating
(Shi et et
et al.,al., 2014)
substrate algorithm
sequence
2014) algorithm
al., 2014) algorithm for for
similarity relevance
for relevance information
relevance search searchin
search by by
the
by
rewired
aberrant, ifwhichthe corresponding kinases or substrates
(Lahiryare et (Shi
rewired
rewired if
if the may
the result in a variety
corresponding
corresponding kinases
kinases of diseases
or
or substrates
substrates are
are incorporating
kinase-substrate substrate
incorporating substrate sequence
heterogeneous
substrate sequence similarity
network.
sequence similarity information
It
similarity information extends
information in in the
the
in the
aberrant,
al., 2010).whichOn thismay result in a variety of diseases (Lahiry et incorporating
aberrant,
aberrant, which
which maypoint,
may result identification
result in
in aa variety
variety ofofofdiseases
potential
diseases kinase-
(Lahiry
(Lahiry et kinase-substrate
et kinase-substrate heterogeneous
HeteSim algorithmheterogeneousby treating substrate network.
network. It extends
similarities
It extends the
as a new
the
al., 2010).
substrate On this point, identification of potential kinase- kinase-substrate heterogeneous network. It extends the
al.,
al., 2010).relation
2010). On this(KSR)
On this point,may
point, increase our
identification
identification of understanding
of potential
potential kinase- of HeteSim algorithm by treating substrate similarities as a new
kinase- kind
HeteSimof edges in
algorithm heterogeneous
by treating networks.
substrate To demonstrate
similarities as a the
new
substrate
the relation
molecular
substrate (KSR)
(KSR) may
processes.
relation may increase
increase our our understanding
understanding of of HeteSim algorithm by treating substrate similarities as a new
substrate relation (KSR) may increase our understanding of kind ofof edges
performance edges of in heterogeneous
the method, we networks.
compare To demonstrate
demonstrate the
the molecular
the molecular processes.
molecular processes.
processes.
kind
kind of
performance edges in
in heterogeneous
of heterogeneous
the method,
networks.
networks.
we compare To with
To the initial
demonstrate
with the
the
the
initial
the HeteSim method and the LapRLS
To this end, many experimental efforts have been done in performance performance of
of the
the method,
method, we (Chen
we compare
compare et al., 2013)
with
with themethod.
the initial
initial
To this
identifyingend, many
kinases experimental
and substrates efforts
with have
the been done
corresponding in HeteSim
The
HeteSim method and
experiment
method and the LapRLS
results
the LapRLS
show that (Chen
(Chen the et al., 2013)sequence
etsubstrate
al., 2013) method.
method.
To this
To this end,
end, many
many experimental
experimental efforts efforts have
have been
been donedone in in HeteSim method and the LapRLS (Chen et al., 2013) method.
The experiment
such similarity experiment results to
is helpful showachieve
that the the substrate
substrate sequence
identifying kinases
phosphorylation
identifying kinases
identifying kinases
and
sites.and
and
substrates
Though
substrates
substrates
with
somewith the methods,
existing
with the corresponding
corresponding
the methods,
corresponding
The
The experiment results
results show
show that
that thebetter
substrate identification
sequence
sequence
phosphorylation
as GPS (Xue et sites. Though some
2008), some
al.,Though existing
PPSPexisting
(Xue et such
al., 2006), similarity
performance
similarity is isin helpful
heterogeneous
is helpful
helpful to to achieve
networks.
to achieve
achieve better better identification
better identification
identification
phosphorylation
phosphorylation sites.
sites. Though some existing methods,
methods, such
such similarity
as GPS
KinasePhos (Xue (Wonget al.,
et 2008),
al., 2007),PPSPand (Xue
Musite et al.,
(Gao 2006),
et al., performance in
performance in heterogeneous
heterogeneous networks.
networks.
as GPS (Xue et al., 2008), PPSP
as GPS (Xue et al., 2008), PPSP (Xue et al., 2006), (Xue et al., 2006), performance in heterogeneous networks.
KinasePhos
2010), (Wong
may predict et al., 2007), and Musite (Gao et al., 2. METHODS
KinasePhos
KinasePhos (Wong potential
(Wong et
et al., kinase-specific
al., 2007),
2007), and
and Musite
Musitephosphorylation
(Gao
(Gao et et al.,
al., 2. METHODS
2010), may
sites (P-sites),
2010), may predict potential
butpotential kinase-specific
these approaches phosphorylation
mainly focus on 2.
2. METHODS
2010), may predict
predict potential kinase-specific
kinase-specific phosphorylation
phosphorylation METHODS
sites (P-sites),
identifying
sites (P-sites),new
(P-sites), but but
P-sites
but thesethese
and approaches
consequently
these approaches
approaches mainly mainly
achieve
mainly focus focus on
relatively
focus on on 2.1 Heterogeneous Information Network
sites
identifying
poor performance
identifying new
new P-sites
for KSR
P-sites and consequently
consequently achieve
andprediction. achieve relatively
relatively 2.1 2.1 Heterogeneous Information Information Network
identifying
poor
new
performance
P-sites
for KSR
and consequently
prediction.
achieve relatively 2.1 Heterogeneous
Heterogeneous Information Network Network
poor performance
poor performance for for KSR
KSR prediction.
prediction. The heterogeneous network is a special type of network
To solve this limitation, Linding et al. proposed a new The heterogeneous network is a special type of network
To solve
solve this
this limitation,
limitation, Linding proposed new The because it includes various
heterogeneous network types
is aaof
is objects and
special typevarious
type of types
of network
method
To solvecalled
To NetworKINLinding
this limitation, (Linding
Linding
et etal.
et etal. al.,
et 2008), aaawhich
al. proposed
proposed new The
because
new because
heterogeneous
it includes
includes network
various types of special
objects of and various network
types
method called NetworKIN (Linding al., 2008), which of relations.
it Here we give
various the
types definition
because it includes various types of objects and various types
of objects and heterogeneous
various types
identifies
method
method calledKSRs NetworKIN
called based sequence
NetworKIN similarities
(Linding
(Linding et al., using
et al., 2008),previous
2008), which
which of of relations. Here we give the definition of heterogeneous
identifies KSRs based sequence similarities using previous relations.network
information
relations. Here
Here we we(Shi the2014)
et al.,
give the definition as follows.
of
of heterogeneous
sequence
identifies motifs
identifies KSRs
KSRs based deserved
based from Scansite
sequence
sequence similarities
similarities (Obenauer
using et al., of
using previous
previous information network network (Shi
give
(Shi et
et al.,
definition
2014) as
al., 2014) as follows.
follows.
heterogeneous
sequence motifs deserved from Scansite (Obenauer et al., information
information network (Shi et al., 2014) as follows.
sequence motifs deserved from Scansite
sequence motifs deserved from Scansite (Obenauer et al., (Obenauer et al.,
2405-8963 ©
Copyright © 2015, IFAC (International Federation of Automatic Control)
IFAC 2015 17 Hosting by Elsevier Ltd. All rights reserved.
Peer review©under
Copyright IFAC responsibility
2015 of International Federation of Automatic
17 Control.
Copyright
Copyright © IFAC 2015
© IFAC 2015
10.1016/j.ifacol.2015.12.093 17
17
2015 IFAC SYSID
18
October 19-21, 2015. Beijing, China Haichun Li et al. / IFAC-PapersOnLine 48-28 (2015) 017–021

Firstly, we introduce a new schema SP={A,R}, which includes 2.3 HeteSim


a set of object types A={A} and a set of relation types R={R},
then a directed graph G={V,E} is defined as an information For a given path P  R1 R2 Rl , the HeteSim (Shi et
network, where each object v is related to a specific object
type by a mapping function Φ(v) ∈ A and each link e is al., 2014) score of the two objects s and t (s∈R1 represents
related to a specific relation by a mapping function φ(e)∈R . source object and t∈Rl represents target object):
The network is named heterogeneous information network if HeteSim( s, t R1 R2 Rl )
the numbers of object types |A| > 1 or the numbers of relation
1 O (s R1 ) I(t Rl ) (3)
types |R| > 1.   
i 1 j 1
O(s R1 ) I(t Rl )
2.2 Search Path HeteSim(O( s R1 ), I(t Rl ) R2 Rl 1 )
where HeteSim(s,t|P) is the similarity of object s and t. O(s|R1)
Search path represents a kind of pathway putting different is the out-neighbors of object s along the path P, and I(t| Rl) is
type of objects together based associations between them. the in-neighbors of object t along P in the opposite direction.
The search path P is based on the schema SP = (A,R) , and
The value of the formula above is from 0 to 1. The two
the detail definition is given as follows:
objects are more similar only if the HeteSim value is higher.
R1 R2 RL
A1   A2     Al1 (1) To optimize the calculation, the HeteSim formula can be
The path P represents a composite link between type A1 and normalized below:
Al+1 below: PM pL (s,:) PM p 1 (t,:)
NormHeteSim(s,t|P)= R (4)
R =R1 R2 Rl (2)
PM pL (s,:) PM p 1 (t,:)
R
where represents the operator, which links different
relations. And l is the length of P, which denotes the number where PM is the probability matrix defined below:
of relations under the search path P. (5)
PM =U A1 , A2 U Al1 , Al
Different search paths in heterogeneous network have where U A represents the adjacency between the type Al-1
different semantics, therefore the relation between two l 1 , Al

objects is different via different search paths. Taking the and Al. PMP (a,b) is the probability from object a∈A1 to
kinase-substrate heterogeneous network in Fig.1(a) for object b∈ Al along the path P, and PMP (s,:) means the sth
example, substrates and kinases can be connected via row in PMP . In addition, The path PL and PR here is the
“Substrate-Substrate-Kinase” (SSK) path, “Substrate- decomposition of the path P from the middle type, that is
Substrate-Substrate-Kinase” (SSSK) path, and so on. The PL  ( A1 A2 Amid ) , and PR = (Amid Amid 1 Al).
SSK path means kinases can catalyze similar substrates,
while the SSSK path means that kinases can catalyze
substrates which are both similar to a certain substrate. 2.4 HeteSim-S

HeteSim provides an innovative framework for kinase-


substrate relation prediction in heterogeneous networks, but it
only focuses transition matrix between different types of
objects without considering protein sequence information
which may hampers further prediction performance of KSRs.
So in this paper, we proposed a new method called HeteSim-
S which utilizes both substrate sequence similarities and
heterogeneous network information in kinase-substrate
relation prediction.
As shown in Fig.1(a), HeteSim-S extends the HeteSim
algorithm by treating substrate sequence similarities as a new
kind of links in KSR heterogeneous networks. As substrate
sequence similarity ranges from 0 to 1, accordingly the
HeteSim-S is defined as follows via pre-defined SSK path:
Fig.1. (a) Kinase-substrate heterogeneous network based on 1
substrate sequence similarities, the blue lines represent HeteSim-S ( si , k j |SS SK )= O ( si |SS )
I (k j | SK )  a 1 r iaOa ( si | SS ) (6)
known kinase-substrate relations, and the green line reflects
substrate sequence similarity score (from 0 to 1) between 
O ( si |SS )

I ( k j |SK )
ria HeteSim-S (Oa ( si | SS ), Ib (k j | SK ))
a 1 b 1
different substrates. (b) A simple example of substrate
similarity matrix. where ria represents the similarity score between substrate si
and sa. For two local sequences seqi and seqa surrounding the

18
2015 IFAC SYSID
October 19-21, 2015. Beijing, China Haichun Li et al. / IFAC-PapersOnLine 48-28 (2015) 017–021 19

phosphorylation sites in substrate si and sa, ria is calculated by


the following formula:
15
ria ( seqi , seqa )   Blosum62( seqi ( j ), seqa ( j )) (7)
j 1

where Blosum62 represents the Blosum62 matrix widely


used for sequence alignment. And for the same-typed objects,
the HeteSim-S value is defined as follows:
1 s is same as t
HeteSim( s, t )  { (8)
0 else
Suppose the substrate similarity matrix is known in
Fig.1(b) , we have O(s1|SS)={s1, s2, s3, s4} and
I(k3|SK)={s2, s4}, the result of substrate s1 and kinase k3
based the path SSK can be calculated as follows:
1
HeteSim-S ( s1 , k3 |SS SK )= 4
2 a 1 r1a Oa ( s1 | SS )
4 2
 a 1
r HeteSim-S (Oa ( s1 | SS ), Ib ( k3 | SK ))
b 1 1a

which leads to a value of 0.241. After normalized by (4), the Fig. 2. The box graph shows the kinase-substrate relations of
final HeteSim-S value between s1 and k3 is 0.559. 15 kinases and the pie graph shows the proportion of each
kinase known relations in all known relations of 15 kinases
3. METERIAL
3.2 Evaluation Settings
3.1 Kinase-Substrate Relation Data
To evaluate the prediction performance, several
In this study, Phospho.ELM (9.0) (Dinkel et al., 2011) was measurements are calculated as benchmarks: sensitivity (Sn)
selected as the phosphorylation dataset, which included and specificity (Sp) are defined as the proportion of positives
37,145 experimentally identified human phosphorylated sites, or negatives relations which could be actually predicted,
where 3,599 sites were related with the corresponding kinase respectively; precision (Pre) reflects the percentage of true
information. As we focus on the kinase-substrate relation, so positives in predicted positives; accuracy (Acc) indicates the
some pre-process should be carefully performed for the data. proportion of correct prediction in the test datasets; Matthews
First, the redundant phosphorylation entries were removed. correlation coefficient (MCC) is regarded as a balance
Then the protein sequences were clustered by Blastclust measure, which demonstrated the correlation between the
(Dondoshansky et al., 2002) based a sequence similarity observed and predicted binary classifications. The detailed
threshold of 70% to avoid overestimation. After that, 2,138 definitions are shown below:
kinase-substrate known relations (1,734 substrates and 210
kinases) were finally obtained. TN  TP
Acc 
TN  TP  FN  FP
Moreover, to ensure reliable results, 15 common kinases TP
were selected to test the performance in our work. Fig.2 Sn 
showed the detail information of these 15 kinases. FN  Tp
TN
Sp 
TN  FP
TP
Pre 
FP  TP
TP  TN  FP  FN
MCC 
(TN  FN )  (TN  FP)  (TP  FN )  (TP  FP)
where TN, TP, FN, and FP respectively represent true
negatives, true positives, false negatives and false positives.
Besides, to further evaluate the performance, the receiver
operating characteristic (ROC) curve is plotted and the area
under ROC curve (AUC) is also calculated in our work.

19
2015 IFAC SYSID
20
October 19-21, 2015. Beijing, China Haichun Li et al. / IFAC-PapersOnLine 48-28 (2015) 017–021

4. RESULTS Table.1 AUC value of HeteSim and HeteSim-S


Leave-one-out cross validation (LOOCV) is implemented on Kinase HeteSim HeteSim-S
the known experimentally verified kinase-substrate relations CDK1 0.647 0.908
PKC_alpha 0.581 0.894
to evaluate the performance of HeteSim-S. In every run
CK2_alpha 0.546 0.962
process of LOOCV, each known kinase-substrate relation is SRC 0.664 0.883
left out by setting its entity in the adjacency matrix to 0 and MAPK1 0.550 0.884
this known KSR is regarded as test sample while all other MAPK3 0.565 0.879
kinase-substrate relations are regarded as training samples. ATM 0.873 0.971
CDK2 0.876 0.865
In this article, the method HeteSim-S we proposed is MAPK14 0.851 0.888
compared with different algorithms in two ways, overall GSK3_beta 0.690 0.824
performance and performance evaluation for specific kinases Lck 0.670 0.793
based different benchmarks. PLK1 0.739 0.706
EGFR 0.904 0.937
SYK 0.850 0.851
4.1 Overall Performance Fyn 0.807 0.876
Average value 0.678 0.875
We first compare the overall performance for all kinases and
substrates (1734 substrates and 210 kinases) between our
method and the HeteSim algorithm, which adopts protein- Then, we introduce another algorithm LapRLS, which adopt
protein interactions in heterogeneous information network substrate sequence similarities and known kinase-substrate
instead of substrate sequence similarities. And we choose the relations as HeteSim-S. And we choose 4 kinds of kinases to
path SSK for HeteSim and HeteSim-S in this work. compare HeteSim-S, HeteSim and LapRLS based on all the
Particularly, we set the random prediction as a control. benchmarks mentioned above.
Fig.3 shows the ROC results and AUC value of these three As shown in Fig.4, we can obviously see that HeteSim-S
prediction methods. The HeteSim-S is consistently better performs consistently better than other two algorithms, for
than other two methods as the ROC curve of HeteSim-S is instance, for kinase MAPK3 our method achieves an AUC of
always in the above than HeteSim and random, and the AUC 0.879, which is 4.3% and 31.4% higher than the results of
value of HeteSim-S is 0.842, which improves the LapRLS and HeteSim, respectively.
performance of random and HeteSim by 34.9% and 3.5%,
respectively.

Fig.3. ROC curves of HeteSim, HeteSim-S and random based


LOOCV schema.
Fig.4. ROC curves and AUC value of different algorithms

4.2 Performance Evaluation for Specific Kinases Besides, Comparison based Acc, Pre, Sn, Sp, and MCC has
also been studied in our work. Taken PKC_alpha for example,
To further evaluate the performance of our method, we test Table.2 shows the performance in given situation of Sp=0.99
the 15 kinases we selected before respectively. In the and 0.95., the Acc, Sn, Pre and MCC are increased by 0.5%,
beginning, we calculate all AUC results of these 15 kinases 8.9%, 6.7% and 6.8% (Sp=0.95, compared with LapRLS),
between HeteSim-S and HeteSim. As shown in Table.1, 2.2%, 36.6%, 27.3%, and 31.8% (Sp=0.95, compared with
except for CDK2 and PLK1, the performance of HeteSim-S HeteSim), 0.2%, 3.4%, 5.3%, and 4.7% (Sp=0.99, compared
is better than HeteSim all the time, and the average value of with LapRLS) and 1.1%, 17.7%, 50.9%, and 30.4% (Sp=0.99,
HeteSim-S achieves 87.5%, which is increased by 19.7% compared with HeteSim). In general, our method largely
compared with HeteSim. Moreover, for the kinase ATM, the outperforms the other two methods.
AUC value of our method is up to 0.971.

20
2015 IFAC SYSID
October 19-21, 2015. Beijing, China Haichun Li et al. / IFAC-PapersOnLine 48-28 (2015) 017–021 21

Table.2 Comparison for PKC_alpha Linding, R., Jensen, L. J., Pasculescu, A., Olhovsky, M.,
Colwill, K., Bork, P., ... & Pawson, T. (2008).
Sp=0.99 Sp=0.95
NetworKIN: a resource for exploring cellular
LapRL HeteSi HeteSi LapRL HeteSi HeteSi
phosphorylation networks. Nucleic acids
S m m-S S m m-S
research, 36(suppl 1), D695-D699.
Acc 0.936 0.927 0.938 0.915 0.898 0.920
Manning, G., Whyte, D. B., Martinez, R., Hunter, T., &
Sn 0.152 0.009 0.186 0.402 0.125 0.491 Sudarsanam, S. (2002). The protein kinase complement
Pre 0.515 0.059 0.568 0.357 0.151 0.424 of the human genome. Science,298(5600), 1912-1934.
MC Obenauer, J. C., Cantley, L. C., & Yaffe, M. B. (2003).
0.255 -0.002 0.302 0.333 0.083 0.401
C
Scansite 2.0: Proteome-wide prediction of cell signaling
interactions using short sequence motifs. Nucleic acids
research, 31(13), 3635-3641.
5. CONCLUSION Olsen, J. V., Blagoev, B., Gnad, F., Macek, B., Kumar, C.,
Traditional experimental identification of KSRs by verifying Mortensen, P., & Mann, M. (2006). Global, in vivo, and
every possible relation is proved not the effective way as it is site-specific phosphorylation dynamics in signaling
a labor-intensive, expensive and time-consuming process. networks. Cell, 127(3), 635-648.
Computational approaches based known information to Shi, C., Kong, X., Huang, Y., Philip, S. Y., & Wu, B. (2014).
predict those kinase-substrate relations which are most likely HeteSim: A General Framework for Relevance Measure
to exit is one of the key area in KSR prediction. Although, in Heterogeneous Networks.IEEE Transactions on
some computational algorithms have been developed to Knowledge & Data Engineering, (10), 2479-2492.
identify potential kinase-substrate relations, most of these Song, C., Ye, M., Liu, Z., Cheng, H., Jiang, X., Han, G., ... &
methods used only protein sequence information without Zou, H. (2012). Systematic analysis of protein
considering phosphorylation. In our study, we proposed a phosphorylation networks from phosphoproteomic
novel KSR prediction method named HeteSim-S based both data. Molecular & Cellular Proteomics, 11(10), 1070-
substrate sequence similarity and heterogeneous network 1083.
through HeteSim framework. Ubersax, J. A., & Ferrell Jr, J. E. (2007). Mechanisms of
specificity in protein phosphorylation. Nature reviews
In this paper, we test HeteSim-S with different algorithms in Molecular cell biology, 8(7), 530-541.
different ways based different benchmarks. Experimental Wong, Y. H., Lee, T. Y., Liang, H. K., Huang, C. M., Wang,
results show that our method can effectively predict kinase- T. Y., Yang, Y. H., ... & Hwang, J. K. (2007).
substrate relations than HeteSim and LapRLS. Specially, KinasePhos 2.0: a web server for identifying protein
protein sequence information is helpful to achieve better kinase-specific phosphorylation sites based on sequences
identification effect in kinase-substrate heterogeneous and coupling patterns. Nucleic acids research, 35(suppl
networks. Therefore, further efforts could be made to find 2), W588-W594.
more effective methods for KSR prediction in the future. Xu, X., Li, A., Zou, L., Shen, Y., Fan, W., & Wang, M.
(2014). Improving the performance of protein kinase
ACKNOWLEDGEMENT identification via high dimensional protein–protein
interactions and substrate structure data. Molecular
This work is supported by National Natural Science BioSystems,10(3), 694-702.
Foundation of China (No. 61471331 and No. 61101061). Xue, Y., Li, A., Wang, L., Feng, H., & Yao, X. (2006). PPSP:
prediction of PK-specific phosphorylation site with
REFERENCES Bayesian decision theory. BMC bioinformatics, 7(1), 163.
Chen, X., & Yan, G. Y. (2013). Novel human lncRNA– Xue, Y., Ren, J., Gao, X., Jin, C., Wen, L., & Yao, X. (2008).
disease association inference based on lncRNA GPS 2.0, a tool to predict kinase-specific
expression profiles. Bioinformatics, btt426. phosphorylation sites in hierarchy. Molecular & Cellular
Dinkel, H., Chica, C., Via, A., Gould, C. M., Jensen, L. J., Proteomics, 7(9), 1598-1608.
Yin, Z., & Tan, J. (2012, August). New encoding schemes for
Gibson, T. J., & Diella, F. (2011). Phospho. ELM: a prediction of protein Phosphorylation sites. In Systems Biology
database of phosphorylation sites—update 2011. Nucleic (ISB), 2012 IEEE 6th International Conference on (pp. 56-62).
acids research, 39(suppl 1), D261-D267. IEEE.
Dondoshansky, I., & Wolf, Y. (2002). Blastclust (ncbi
software development toolkit). NCBI, Bethesda, Md.
Gao, J., Thelen, J. J., Dunker, A. K., & Xu, D. (2010). Musite,
a tool for global prediction of general and kinase-specific
phosphorylation sites.Molecular & Cellular
Proteomics, 9(12), 2586-2600.
Lahiry, P., Torkamani, A., Schork, N. J., & Hegele, R. A.
(2010). Kinase mutations in human disease: interpreting
genotype–phenotype relationships.Nature Reviews
Genetics, 11(1), 60-74.

21

You might also like