You are on page 1of 8

468706.

qxd 7/31/03 1:38 PM Page 395

Journal of Protein Chemistry, Vol. 22, No. 4, May 2003 (© 2003)

Application of Pseudo Amino Acid Composition for


Predicting Protein Subcellular Location: Stochastic
Signal Processing Approach

Yu-Xi Pan,1,3 Zhi-Zhou Zhang,1 Zong-Ming Guo,1 Guo-Yin Feng,1 Zhen-De Huang,1
and Lin He1,2

January 27, 2003

The function of a protein is closely correlated with its subcellular location. With the success of
human genome project and the rapid increase in the number of newly found protein sequences
entering into data banks, it is highly desirable to develop an automated method for predicting the
subcellular location of proteins. The establishment of such a predictor will no doubt expedite the
functionality determination of newly found proteins and the process of prioritizing genes and pro-
teins identified by genomics efforts as potential molecular targets for drug design. Based on the con-
cept of pseudo amino acid composition originally proposed by K. C. Chou (Proteins: Struct. Funct.
Genet. 43: 246–255, 2001), the digital signal processing approach has been introduced to partially
incorporate the sequence order effect. One of the remarkable merits by doing so is that many ex-
isting tools in mathematics and engineering can be straightforwardly used in predicting protein sub-
cellular location. The results thus obtained are quite encouraging. It is anticipated that the digital
signal processing may serve as a useful vehicle for many other protein science areas as well.

KEY WORDS: Quasi-sequence order effect; covariant-discriminant algorithm; Mahalanobis distance; Chou’s
invariance theorem; low/high-pass Butterworth filter; bioinformatics; proteomics.

1. INTRODUCTION is time consuming and costly to acquire this kind of


knowledge solely based on experimental measures. Con-
As a result of the development of the high-throughput fronted with the enormous data sets in the areas of
sequencing technology with its improving efficiency and genome and proteome, it turns out to be highly desirable
decreasing cost, the data in various biological database to develop an efficient method to predict the subcellular
has been increasing at an unprecedented speed in the location of a new protein so as to expedite the process
recent years (Chou and Elrod, 1999a, 1999b). The of deducing its function. Many efforts were made in this
explosion of biological data challenges biologists’ and regard (Cai and Chou, 2000; Cai et al., 2000, 2002a,
computer scientists’ ability and speed of analyzing these 2002b; Cedano et al., 1997; Chou, 2000a, 2000b; Chou
data. The function of a protein is closely correlated with and Cai, 2002; Chou and Elrod, 1998, 1999a, 1999b;
its subcellular location (Chou, 2000b). Although the sub- Nakashima and Nishikawa, 1994; Reinhardt and
cellular location of a protein can be determined by con- Hubbard, 1998; Zhou and Doctor, 2003). Actually, a
ducting various locational determination experiments, it new branch in proteomics, the so-called “Prediction of
Protein Cellular Attributes” (Chou, 2002) has emerged.
1
Bio-X Life Science Research Center, Shanghai Jiao Tong University, It should be pointed out that most of the existing algo-
Shanghai, China. rithms were based on the amino acid composition alone
2
Shanghai Institutes for Biological Sciences, Chinese Academy of (Cedano et al., 1997; Chou and Elrod, 1998, 1999a,
Science, Shanghai, China.
3
To whom correspondence should be addressed at Bio-X Life Science
1999b; Nakashima and Nishikawa, 1994; Reinhardt and
Research Center, Shanghai Jiao Tong University, Shanghai, 200030, Hubbard, 1998; Zhou and Doctor, 2003). Although this
China. E-mail: wzhong@chartermi.net is a reasonable approximate method and did yield some
395
0277-8033/03/0500-0395/0 © 2003 Plenum Publishing Corporation
468706.qxd 7/31/03 1:38 PM Page 396

396 Pan, Zhang, Guo, Feng, Huang, and He

encouraging result (Chou, 2000b), the sequence order in- where xi is the numerical code for Ri (i = 1, 2, . . . , N ).
formation should also be taken into account as a logical Thus, we can immediately generate five characteristic
step for further development in this area. However, parameters as formulated below. The first one is the
owing to the extreme variation in both the sequence signal mean value, given by
order and the sequence length, it is very difficult to for-
mulate a prediction algorithm that can incorporate the 1 N

sequence order effects as well. An important advance in p1 = M(x) = x(i) (3)


N i=1
this regard is the introduction of the concept of pseudo
amino acid composition as recently proposed by Chou
The second parameter is called standard deviation
(2001). The essence of pseudo-amino-acid-composition
given by:
is: on one hand, it bears the main feature of amino acid
composition; but on the other, it also contains some
1 N
elements to reflect the sequence order effects. p2 = SD(x) = |x(i) − M(x)| (4)
This study was initiated in an attempt to use the N i=1
concept of pseudo amino acid composition and the
stochastic signal processing approach to develop a new The SD reflects how a signal is dispersed around its
method for predicting protein subcellular location. mean value. The smaller the SD is, the more centralized
the signal is toward its mean value. The third parameter
is the variance as defined by
2. METHOD
1 N

A protein sequence is composed of a series of amino p3 = var(x) = [x(i) − M(x)]2 (5)


N i=1
acids represented by characters as A, C, D, E, F, G, H, I,
K, L, M, N, P, Q, R, S, T, V, W, and Y. As lingual sym-
The fourth and fifth characteristic parameters are the
bols, the sequence composed by these characters cannot
low-pass and high-pass Butterworth filter contents (Candy,
participate in any form of mathematical computation in
1988; Jones, 1982; Tretter, 1990) as represented by
computers because each element in this sequence does not
have its corresponding numerical value at all. To describe
cov(x, yL )
a protein sequence in a quantitative way, what kind of p4 = corrcoef(x, yL ) = (6)
code should be used to represent a certain amino acid? std(x) · std( yL )
Actually, any numerical code would work as long as it can
distinguish one amino acid from another. For simplicity, where yL is the output signal of the low-pass Butterworth
let us just assign: A = 10, C = 20, D = 30, E = 40, F = filtering process with the input signal of {xi }. Thus, yL
50, G = 60, H = 70, I = 80, K = 90, L = 100, M = 110, represents the low-frequency component of the digital
N = 120, P = 130, Q = 140, R = 150, S = 160, T = 170, signal {xi }. The low-pass filtering process can be given
V = 180, W = 190, and Y = 200. Through the above en- by the following expression
coding procedure a protein sequence is transformed to a
serial of digital signals. Thus, all the existing tools in the 
26
yL (n) = − a(k) yL (n − k + 1)
area of digital signal processing (DSP) can be straightfor-
k=2
wardly used for the current study.
DSP is defined as a process of sampling, transform- 
26
+ b(k)x(n − k + 1) (7)
ing, synthesizing, estimating, and recognizing of signals
k=1
through numerical computing with computers and some
other special equipment so as to extract useful informa- where {a(i)} and {b(i)} are two sets of parameters of the
tion from the signals. Given a protein sequence low-pass digital filter with their values listed in Table 1.
The filtering process expressed by Eq. 7 and how the dig-
R1R2R3R4R5R6R7 · · · RN (1)
ital filter’s parameters of {a(i)} and {b(i)} are deter-
where R1 is the 1st residue, R2 the 2nd, and so forth. To mined could be easily found in the aforementioned ref-
utilize the technique of DSP, let us first represent the erence books. In Eq. 6, corrcoef(x, yL ) represents the
protein sequence by a stochastic digital signal, i.e., function of correlation coefficient of the two digital sig-
nals of x and yL . This function reflects the correlative ex-
{xi }, 1≤i ≤N (2) tent of the two signals. In addition, cov(x, yL ) represents
468706.qxd 7/31/03 1:38 PM Page 397

Application of Pseudo Amino Acid Composition 397

Table 1. Parameters of {a(i)} and {b(i)} for the Low-Pass Butterworth Filter in this Work (cf. Eq. 7)

Index: i 1 2 3 4 5 6 7 8 9 10 11 12 13
Values of {a(i)} 1.00 5.00 15.11 32.49 54.80 75.51 87.53 86.78 74.49 55.78 36.63 21.16 10.77
Values of {b(i)} 1.68e-5 4.21e-4 5.05e-3 3.87e-2 0.21 0.89 2.98 8.09 18.2 34.38 55.01 75.01 87.51

Index: i 14 15 16 17 18 19 20 21 22 23 24 25 26

Values of {a(i)} 4.82 1.90 0.65 0.20 0.05 1.13e-2 2.11e-3 3.27e-4 4.08e-5 3.95e-6 2.79e-7 1.27e-8 2.83e-10
Values of {b(i)} 87.51 75.01 55.01 34.38 18.20 8.09 2.98 0.89 0.21 0.04 5.05e-3 4.21e-4 1.68e-5

the covariance of the two signals of x and yL , and the ex- where yH is the output signal of the high-pass Butterworth
pression is given by filtering process with the input signal of {xi }. Similarly,
{c(i)} and {d(i)} are the parameters for the high-pass dig-
1  N
ital filter and their values are given in Table 2. Again, the
cov(x, yL ) = (x(i) − M(x))( yL (i) − M( yL ))
N − 1 i=1 filtering process expressed by Eq. 10 and how the digital
(8) filter’s parameters of {c(i)} and {d(i)} are determined
could be easily found in the same reference books.
where N is the length of the amino acid sequence. It can Thus, by following exactly the same procedure as
be seen from the above equations that the characteristic described by Chou (2001), a protein can be expressed by
parameter p4 incorporates some sequence order effect. a vector or a point in a 25-D (dimensional) space, i.e.,
Similarly, we have  
x1
 .. 
cov(x, yH )  . 
p5 = corrcoef(x, yH ) =  
std(x) · std( yH )
(9)  x20 
X=   (12)

 x20+1 
 .. 
where  . 
x20+5

26
yH (n) = − c(k) yH (n − k + 1) where
k=2 
 

fk
(1 ≤ k ≤ 20)
26

 ,
+ d(k)x(n − k + 1) (10) 

20 5

 fi + wj pj
k=1  i=1 j=1
xk = (13)

 wk pk
and 
 , (21 ≤ k ≤ 25)

 20 5

 fi +
cov(x, yH )  wj pj
i=1 j=1
1  N
= (x(i) − M(x))( yH (i) − M( yH )) (11) where f k is the normalized occurrence frequency of
N − 1 i=1 the 20 amino acid in the protein X, pi the ith additional

Table 2. Parameters of {c(i)} and {d(i)} for the High-Pass Butterworth Filter in this Work (cf. Eq. 10)

Index: i 1 2 3 4 5 6 7 8 9 10 11 12 13

Values of {c(i)} 1.00 −5.00 15.11 −32.49 54.80 −75.51 87.53 −86.78 74.49 −55.78 36.63 −21.16 10.77
Values of {d(i)} 1.68e-5 −4.21e-4 5.05e-3 −3.87e-2 0.21 −0.89 2.98 −8.09 18.20 −34.38 55.01 −75.01 87.51

Index: i 14 15 16 17 18 19 20 21 22 23 24 25 26

Values of {c(i)} −4.82 1.90 −0.65 0.20 −0.05 0.01 −2.11e-3 3.27e-4 −4.08e-5 3.95e-6 −2.79e-7 1.27e-8 −2.83e-10
Values of {d(i)} −87.51 75.01 −55.01 34.38 −18.20 8.09 −2.98 0.89 −0.21 3.87e-2 −5.05e-3 4.21e-4 −1.68e-5
468706.qxd 7/31/03 1:38 PM Page 398

398 Pan, Zhang, Guo, Feng, Huang, and He

characteristic parameter derived from digital signal pro- 3. RESULTS AND DISCUSSION
cessing theory (see Eqs. 2–11), and wi the weight factor
for the ith parameter pi . In this work, we chose the For testing our approach, the data set constructed by
weight factors given by Chou and Elrod (1999b) was adopted here since their
    data set contains 12 subcellular locations, much more
w1 0.5 complete and practical than those constructed by the
 w   0.015  other investigators. However, as mentioned in Chou
 2  
 w  =  0.005  (14) (2001), due to the change of code names, the sequences
 3  
 w4   0.5  for some proteins could no longer be retrieved from the
w5 0.5 SWISS-PROT databank. Of the 2319 proteins originally
listed in Appendix A of Chou and Elrod (1999b), 2191
The purpose of using the weight factors is to re- protein sequences were retrieved. They form the data set
arrange each of these five characteristic parameters and S12, which consists of 145 chloroplast proteins, 571
limit their magnitude within the same region as that of cytoplasm, 34 cytoskeleton, 49 endoplasmic reticulum,
the 20 amino acid composition components of the pro- 224 extracellular, 25 Golgi apparatus, 37 lysosome, 84
tein. According to our preliminary computed results, the mitochondria, 272 nucleus proteins, 27 peroxisome, 699
values of the 5 characteristic parameters (p1, p2, p3, p4, plasma membrane, and 24 vacuole. For the convenience
p5) are distributed in quite different ranges. Among the of those readers who are not trained as a cellular biolo-
25 pseudo amino acid components in Eq. 12, a compo- gist, a schematic illustration is given in Fig. 1 to show
nent with an oversize magnitude might hide the contri- the 12 different subcellular locations of proteins.
bution from all the others. Therefore, before exactly The prediction quality was examined by two meth-
knowing how strong role each of the five characteristic ods, the self-consistency test and the jackknife test. In the
parameters plays, it would be wise to use the weight self-consistency test, the subcellular location for each of
factor as shown in Eq. 14 to adjust their magnitudes and the proteins in a given data set was in turn identified
make them fall within the range as the corresponding using the rules derived from the same data set, the so-
20 amino acid composition components do. As we can called training data set. The predicting rate thus obtained
see from the Eqs. 12–13, the 25-D vector of a protein for the 12 subcellular locations of the 2191 proteins is
contains the information of not only the amino acid com- summarized in Table 3, from which we can see that 1779
position but also some of the sequence order effects. proteins were correctly predicted for their subcellular lo-
Now we can directly use the augmented covariant- cations and 412 proteins were incorrectly predicted. The
discriminant algorithm developed by Chou (2000a) to overall success rate is 81.5%, indicating a quite high self-
conduct the prediction. The covariant-discriminant consistency. However, it should be pointed out that,
algorithm is a combination of Mahalanobis distance during the process of the re-substitution test, the rule
(Mahalanobis, 1936; Pillai, 1985) and Chou’s invariance parameters derived from the training data set include the
theorem (Chou, 1995; Zhou and Doctor, 2003). The information of the query protein sequence later plugged
details about the algorithm can be found in some recent back in the test. This will certainly underestimate the
papers (Chou, 2000a, 2001) and hence there is no need error and enhance the success rate because the same pro-
to repeat here. It is instructive, however, to point out teins are used to derive the rule parameters and to test
that, since the normalization condition imposed by themselves. Accordingly, the success rate thus obtained
Eq. 13, the 20 + 5 components of the pseudo amino represents some sort of optimistic estimation (Cai, 2001;
acid composition are not independent. Therefore, a Chou, 1995; Chou and Zhang, 1994; Zhou, 1998; Zhou
dimension-reduced operation by leaving out one of the and Assa-Munt, 2001; Zhou and Doctor, 2003). Never-
components and making the rest completely independent theless, the re-substitution test is absolutely necessary
is needed when using the augmented covariant discrim- because it reflects the self-consistency of a prediction
inant algorithm, i.e., a protein should be defined in a method, especially for its algorithm part. A prediction
24-D space instead of 25-D space. Otherwise, a diver- algorithm certainly cannot be deemed as a good one if
gence difficulty will occur. However, which one of the its self-consistency is poor. In other words, the re-
25 components should be removed? Any one. The rea- substitution test is necessary but not sufficient for evalu-
son is that according to Chou’s Invariance Theorem ating a prediction method. As a complement, a cross-
(Chou, 1995), the values of the covariant discriminant validation test for an independent testing data set is
function will remain the same regardless of which one of needed because it can reflect the effectiveness of a pre-
the 25 components is left out. diction method in practical application. This is important
468706.qxd 7/31/03 1:38 PM Page 399

Application of Pseudo Amino Acid Composition 399

Fig. 1. Schematic illustration to show the twelve subcellular locations of proteins: chloroplast, cytoplasm, cytoskeleton, endoplasmic
reticulum, extracell, Golgi apparatus, lysosome, mitochondria, nucleus, peroxisome, plasma membrane, and vacuole. Note that the
vacuole and chloroplast proteins exist only in a plant. [Reproduced from Fig. 2 of Chou (2001) with permission.]

especially for checking the validity of a training data set: based on the remaining proteins. In other words, the sub-
whether it contains sufficient information to reflect all cellular location of each protein is identified by the rules
the important features concerned so as to yield a high derived using all the other proteins except the one that is
success rate in practical application. being identified. During the process of jackknifing both
As is well known, the independent data set test, sub- the training data set and the testing data set are actually
sampling test and jackknife test are the three methods open, and a protein will in turn move from one to the
often used for cross-validation in statistical prediction. other. The results of jackknife test thus obtained for
Among these three, however, the jackknife test is deemed the 2191 proteins are also given in Table 3, from which
as the most effective and objective one; see, e.g., a rele- we can see that 1492 proteins were correctly predicted for
vant review (Chou and Zhang, 1995) for a comprehensive their subcellular locations whereas 699 proteins were in-
discussion about this and a monograph (Mardia et al., correctly predicted. The overall success rate is 67.7%.
1979) for the mathematical principle. During jackknifing, Moreover, as a demonstration of practical applica-
each protein in the data set is in turn singled out as a tion, predictions were also conducted for an independent
tested protein and all the rule parameters are calculated data set on the rule parameters derived from the 2191
468706.qxd 7/31/03 1:38 PM Page 400

400 Pan, Zhang, Guo, Feng, Huang, and He

Table 3. Overall Success Prediction Rates for the 12 Subcellular Locations of Proteins by Different Algorithms and Test Methods

Test method

Algorithm Input Self-consistencya Jackknifea Independent data setb

Least Hamming distance Conventional 10672191 = 48.7% 10332191 = 47.2% 11512494 = 46.2%
(Chou, 1989) amino acid
composition
Least Euclidean distance Conventional 10962191 = 50.0% 10632191 = 48.5% 11972494 = 48.0%
(Nakashima et al., 1986) amino acid
composition
ProtLock (Cedano et al., Conventional 10232191 = 46.7% 9712191 = 44.3% 10182494 = 40.8%
1997) amino acid
composition
Augmented covariant- Pseudo amino 17852191 = 81.5% 14832191 = 67.7% 18422494 = 73.9%
discriminant (Chou, 2001) acid
composition
generated by
digital signal
processing

a
Using the training data set taken from (Chou, 2001; Chou and Elrod, 1999b).
b
Using the independent data set taken from (Chou, 2001; Chou and Elrod, 1999b).

proteins in the training data set. The independent data set complement Chou’s approach. The five pseudo amino
was also adopted from Chou and Elrod (1999b). How- acid components introduced here have a “global feature”
ever, for the same reason as mentioned above, of the regardless of the length of proteins considered. However,
2591 independent proteins originally studied in (Chou in Chou’s approach, the sequence-order-correlated rank
and Elrod, 1999b), only 2494 proteins were retrieved. must be smaller than the length of the shortest protein
They are: 112 chloroplast proteins, 761 cytoplasm, 19 chain in the data set considered. Therefore, when a pro-
cytoskeleton, 106 endoplasmic reticulum, 95 extracellu- tein data set contains many short chains, the advantage
lar, 4 Golgi apparatus, 31 lysosome, 163 mitochondria, of the present approach will become more remarkable.
418 nucleus proteins, 23 peroxisome, 762 plasma mem- Furthermore, as is well known for those who are work-
brane. None of these proteins occurs in the training data ing on the area of statistical prediction, an approach,
set. The predicted result thus obtained for the 2494 which has yielded a higher success rate than the other for
proteins in the independent data set are summarized in a given data set, will not necessarily so when used to a
Table 3 as well, from which we can see that 1874 pro- different data set, particularly when the prevailingness is
teins were correctly predicted for their subcellular loca- not overwhelming (Chou and Zhang, 1995).
tions and only 620 proteins were incorrectly predicted. The goal of this study is not to determine the possi-
The overall success rate is 73.9%. ble upper limit of the success rate for the prediction of
Furthermore, to facilitate comparison, the results protein subcellular location, but to propose a different
obtained by some other algorithms using the conven- approach to incorporate the sequence order effect. The
tional amino acid composition (Cedano et al., 1997; results obtained in this study suggest that the stochastic
Chou, 1989; Nakashima et al., 1986) on the same data signal processing approach is quite promising, at least
sets are also listed in Table 3. From the table we can see it may play a complementary role to the other existing
that the success rates obtained by the current approach methods.
are remarkably higher than those by the approaches
based on the conventional amino acid composition alone,
i.e., without taking into account the sequence order ef- 4. CONCLUSION
fects at all. Although the success rates obtained here are
still lower than those by Chou using different ranks of The development of the algorithm for predicting
sequence-order-correlated factors to define the pseudo protein subcellular location generally consists of two
amino acid composition (Chou, 2001), the current cores: one is how to give a mathematical expression to
approach has a unique feature that might be of use to effectively represent a protein and the other is how to
468706.qxd 7/31/03 1:38 PM Page 401

Application of Pseudo Amino Acid Composition 401

find an operational equation to effectively perform the Cai, Y. D., Liu, X. J., Xu, X. B., and Chou, K. C. (2002a). Support vec-
tor machines for predicting membrane protein types by incorpo-
prediction. The process in expressing a protein from the rating quasi-sequence-order effect. Internet Electron. J. Mol. Des.
classical 20-D amino acid composition vector (Chou, 1: 219–226.
1980; Chou and Zhang, 1993, 1994; Nakashima et al., Cai, Y. D., Liu, X. J., Xu, X. B., and Chou, K. C. (2002b). Support vector
machines for prediction of protein subcellular location by incorpo-
1986) to the (20 + )-D pseudo amino acid composition rating quasi-sequence-order effect. J. Cell. Biochem. 84: 343–348.
vector (Chou, 2000a, 2001, 2002) and to the functional Candy, J. V. (1988). In: Signal Processing, McGraw-Hill, New York,
domain approach (Chou and Cai, 2002) reflects the de- pp. 21–98.
Cedano, J., Aloy, P., P’erez-pons, J. A., and Querol, E. (1997). Rela-
velopment of defining a protein in terms of different tion between amino acid composition and cellular location of pro-
mathematical representations. The process in conducting teins. J. Mol. Biol. 266: 594–600.
prediction using from the simple geometry distance al- Chou, K. C. (1995). A novel approach to predicting protein structural
classes in a (20-1)-D amino acid composition space. Proteins:
gorithm (Chou, 1980, 1989; Nakashima et al., 1986), to Struct. Funct. Genet. 21: 319–344.
the Mahalanobis distance algorithm (Cedano et al., 1997; Chou, K. C. (2000a). Prediction of protein subcellular locations by in-
Chou, 1995; Chou and Zhang, 1994), to the covariant corporating quasi-sequence-order effect. Biochem. Biophys. Res.
Commun. 278: 477–483.
discriminant algorithm (Chou et al., 1998; Chou and Chou, K. C. (2000b). Review: Prediction of protein structural classes
Elrod, 1999a, 1999b; Liu and Chou, 1998; Zhou, 1998), and subcellular locations. Curr. Protein Pept. Sci. 1: 171–208.
and to the current SVM algorithm (Chou and Cai, 2002) Chou, K. C. (2001). Prediction of protein cellular attributes using pseudo-
amino-acid-composition. Proteins: Struct. Funct. Genet. 43:
reflects the development of computation by means of dif- 246–255 (Erratum: Proteins: Struct. Funct. Genet. 44: 60, 2001).
ferent mathematical operations. In this study, we have Chou, K. C. (2002). A new branch of proteomics: Prediction of protein
proposed a different approach, the digital signal process- cellular attributes. In: Weinrer, P. W., and Lu, Q. (eds.), Gene
Cloning and Expression Technologies (Chap. 4), Eaton Publish-
ing approach, to define the pseudo amino acid composi- ing, Westborough, MA, pp. 57–70.
tion for a protein, followed by using the augmented Chou, K. C., and Cai, Y. D. (2002). Using functional domain compo-
covariant discriminant algorithm to predict its subcellu- sition and support vector machines for prediction of protein sub-
cellular location. J. Biol. Chem. 277: 45765–45769.
lar location. Our results have further confirmed that the Chou, K. C., and Elrod, D. W. (1998). Using discriminant function
pseudo amino acid (Chou, 2001) is quite a promising ve- for prediction of subcellular location of prokaryotic proteins.
hicle for incorporating the sequence order effect, and that Biochem. Biophys. Res. Commun. 252: 63–68.
Chou, K. C., and Elrod, D. W. (1999a). Prediction of membrane protein
the augmented covariant discriminant algorithm (Chou, types and subcellular locations. Proteins: Struct. Funct. Genet. 34:
2000a) is indeed a powerful technique in performing the 137–153.
task of prediction. One of the remarkable advantages of Chou, K. C., and Elrod, D. W. (1999b). Protein subcellular location
prediction. Protein Eng. 12: 107–118.
introducing the digital signal processing approach is that Chou, K. C., and Elrod, D. W. (2002). Bioinformatical analysis of
many well-established sophisticated mathematical and G-protein-coupled receptors. J. Proteome Res. 1: 429–433.
engineering tools can be directly applied to various sub- Chou, K. C., and Elrod, D. W. (2003). Prediction of enzyme family
classes. J. Proteome Res. 2: 183–190.
areas of biology. For example, the current approach can Chou, K. C., and Zhang, C. T. (1993). A new approach to predicting
be directly used to improve the prediction quality of var- protein folding types. J. Protein Chem. 12: 169–178.
ious protein attributes, such as G-protein-coupled recep- Chou, K. C., and Zhang, C. T. (1994). Predicting protein folding types
by distance functions that make allowances for amino acid inter-
tor types (Chou and Elrod, 2002; Elrod and Chou, 2002) actions. J. Biol. Chem. 269: 22014–22020.
and enzyme-family classes (Chou and Elrod, 2003). Chou, K. C., and Zhang, C. T. (1995). Review: Prediction of protein
structural classes. Crit. Rev. Biochem. Mol. Biol. 30: 275–349.
Chou, K. C., Liu, W., Maggiora, G. M., and Zhang, C. T. (1998). Pre-
ACKNOWLEDGMENTS diction and classification of domain structural classes. Proteins:
Struct. Funct. Genet. 31: 97–103.
Chou, P. Y. (1980). Amino acid composition of four classes of pro-
The authors would like to express their gratitude to teins. Abstracts of Papers, Part I, Second Chemical Congress of
the anonymous referees whose constructive comments are the North American Continent, Las Vegas.
Chou, P. Y. (1989). Prediction of protein structural classes from amino
very helpful for improving the presentation of this paper. acid composition. In: Fasman, G. D. (ed.), Prediction of Protein
Structure and the Principles of Protein Conformation, Plenum
Press, New York, pp. 549–586.
REFERENCES Elrod, D. W., and Chou, K. C. (2002). A study on the correlation of
G-protein-coupled receptor types with amino acid composition.
Cai, Y. D. (2001). Is it a paradox or misinterpretation. Proteins: Struct. Protein Eng. 15: 713–715.
Funct. Genet. 43: 336–338. Jones, N. B. (1982). In: Digital Signal Processing, Peter Peregrinus
Cai, Y. D., and Chou, K. C. (2000). Using neural networks for predic- Ltd., London, UK, pp. 139–161.
tion of subcellular location of prokaryotic and eukaryotic proteins. Liu, W., and Chou, K. C. (1998). Prediction of protein struc-
Mol. Cell Biol. Res. Commun. 4: 172–173. tural classes by modified Mahalanobis discriminant algorithm.
Cai, Y. D., Liu, X. J., Xu, X. B., and Chou, K. C. (2000). Support vec- J. Protein Chem. 17: 209–217.
tor machines for prediction of protein subcellular location. Mol. Mahalanobis, P. C. (1936). On the generalized distance in statistics.
Cell Biol. Res. Commun. 4: 230–233. Proc. Natl. Inst. Sci. India 2: 49–55.
468706.qxd 7/31/03 1:38 PM Page 402

402 Pan, Zhang, Guo, Feng, Huang, and He

Mardia, K. V., Kent, J. T., and Bibby, J. M. (1979). In: Multivariate Reinhardt, A., and Hubbard, T. (1998). Using neural networks for pre-
Analysis, Academic Press, London, pp. 322, 381. diction of the subcellular location of proteins. Nucleic Acids Res.
Nakashima, H., and Nishikawa, K. (1994). Discrimination of intracel- 26: 2230–2236.
lular and extracellular proteins using amino acid composition and Tretter, A. S. (1990). In: Introduction to Discrete-Time Signal Pro-
residue-pair frequencies. J. Mol. Biol. 238: 54–61. cessing, John Wiley & Sons, pp. 276–280.
Nakashima, H., Nishikawa, K., and Ooi, T. (1986). The folding type of Zhou, G. P. (1998). An intriguing controversy over protein structural
a protein is relevant to the amino acid composition. J. Biochem. class prediction. J. Protein Chem. 17: 729–738.
99: 152–162. Zhou, G. P., and Assa-Munt, N. (2001). Some insights into protein struc-
Pillai, K. C. S. (1985). Mahalanobis D2. In: Kotz, S., and Johnson, tural class prediction. Proteins: Struct. Funct. Genet. 44: 57–59.
N. L. (eds.), Encyclopedia of Statistical Sciences (Vol. 5), John Zhou, G. P., and Doctor, K. (2003). Subcellular location prediction of
Wiley & Sons, New York, pp. 176–181. apoptosis proteins. Proteins: Struct. Funct. Genet. 50: 44–48.

You might also like