Professional Documents
Culture Documents
Abstract- Prediction of sub cellular localization of proteins is an biochemical experiments. During the last decade, the number
important step in genome annotation and in search for achieving of new protein sequences has increased to more than 50 times.
novel drug targets. Conducting experiments for extracting For instance the "Swiss-Prot" protein database [4] in 1986
information about protein sub cellular localization is both time
consuming and costly effort. Machine learning approaches, consisted of 3,939 protein sequences but this has now jumped
especially, ensemble of classifiers, providing efficient and reliable up to 201,594 sequences in December, 2005. Owing to such a
mechanism of computational prediction are thus highly desired. large number of newly found proteins, one needs to devise an
In this context, we propose a modification to the approach automated method for the efficient and reliable annotating of
proposed in [K. C. Chou, J. Cell. Biol. 99(2006)517]. We have
used a weighted polling method to fuse the output of individual thedsuclul lvocalizati of unhapracterize proteihns Th
Covariant Discriminant Classifiers. The individual classifiers are knowledge thus evolved out of this process can prove handy in
trained on features based on pseudo-amino acid composition of the drug discovery.
proteins. Three methods of verifications; re-substitution, There have been number of techniques employed in the past
jackknife, and independent data set tests have been employed and for the prediction of protein sub cellular localization. For this
give over all accuracies of 87.13%, 71.15% and 74.90%
respectively. The predicted accuracies are higher than that of the
existing schemes.
purpose, manymets have bn employedtin[-]hich
have used techniques based on Support Vector Machines
(SVM), Covariant Discriminant Classifier (CDC), Neural
Network, etc. The method used in [16] has employed fusion of
The cell is the structural and functional unit of all living different CDC by combining their results using a polling
organisms, and is sometimes called the "building block of life",. method. This method has shown good results as compared to
Humans are multi-cellular organisms and are estimated to have the results in [5-15]. The method described in the current study
1014 (100 trillion) cells [1]. A cell contains approximately 109 is an improvement to the one proposed in [16], by introducing
protein molecules in the different compartments or organelles weighted polling, which helps in better utilization of decision
of the cell [2] and each of these protein inside, has a role to space. Rest of the paper is organized as follows: section II
play. The different organelles of the cell include cell wall, cell describes our proposed technique, results and discussion are
membrane, cytoplasm, carbohydrates, chloroplast, voles, explained in section III, and section IV concludes the paper by
centrioles, endoplasmic reticulum, glgi apparatus, proposing some future work.
Mitochondria, lyosomes, proxisomes, cytoskeletons and
nucleus. The details about the structure and functions of these II. METHODS
organelles can be found in [3]. The different organelles inside
cell are shown in Fig. 1. A. Data Set
Each of the organelle has a function to perform that For the prediction of protein sub-cellular localization, the
contributes to the function of the cell. Many of the functions same training and testing data sets have been used as in
associated with these organelles are performed through [15].These sets consist of 3799 and 4498 sequences
proteins. If the protein at some sub cellular localization or respectively, belonging to 14 organelles inside the cell
organelle is predicted then the function of that sub-cellular containing protein molecules. The number of sequences of
localization can be predicted, which contributes to the overall proteins belonging to each of 14 organelles inside the training
function prediction of the cell. Accordingly, the significance to and testing data set is illustrated in TABLE 1. The codes of
identify the sub cellular localization of an uncharacterized datasets are available at:
protein has become self-evident. http://www.interscience.wiley.com/jpages/0730-23 12/suppmat.
It is time consuming and costly to extract information about B. Amphiphilic Pseudo Amino Acid Composition
the protein sub cellular localization by conducting various
Conventional amino acid composition use only the
TABLE 1 Number of Training & Testing Sequences belonging to each Sub
Extracell Chorpast cellular Organelle
Pk~~m~~ M~~mbr~~ne M~~~toc~~oncTha~~Cell Membrane69 2
Cell Wall
Ccntrl 71 35
Centrioles 65 4
Chloroplast 316 855
RFC,umT Cytoplasm 1113 186
GWg gl ,JK 6k 1g Cytoskeletons 249 131
Endoplasmic Reticulum 289 136
Extracell 393 1252
Golgi Apparatus 90 41
L~~~~~~~~~ysosomeslJ,
Lysosomes 123 57
Mitochondria 389 762
P 0 C Nucleus 399 914
Peroxisomes 147 84
Vacuoles 86 17
Total
Fig. 1. Schematic illustration to show the 14 subcellular locations of proteins: 3799 4498
(1) cell wall, (2) centriole, (3) chloroplast, (4) cytoplasm, (5) cytoskeleton, (6)
endoplasmic reticulum, (7) extracellular, (8) Golgi apparatus, (9) lysosome, different amphiphilic features corresponding to different
(10) mitochondria, (:11) nucleus, (12) peroxisome, (13) plasma membrane, and . ..
(14) vacuole. Note that the cell wall, chloroplast, and vacuole proteins exist hydrophobic and hydrophlic patterns. Fig. 2, at and a2 show
only in a plant cell, while the centriole proteins only in an animal cell. the 1st-tier correlation factors that reflect the sequence-order
Reproduced from [16] with permission. correlations between all the most contiguous amino acids
frequency of occurrence of each amino acid in the protein sequences along a protein chain through hydrophobicity and
sequence or use dipeptide composition of proteins to obtain a hydrophilicity, respectively. Similarly Fig. 2, bl and b2 show
fixed length vector for each protein, in order to train or test on the 2nd-tier correlation factors that reflect the sequence-order
classifier. Unlike conventional composition, APPA correlations between all the 2nd most contiguous amino acids
composition approach as in [16-18] is adopted in the current sequences along a protein chain through hydrophobicity and
study. Moreover, instead of using a fixed dimension for APAA hydrophilicity, respectively and so forth.
composition, multiple dimensions of APAA have been C. Covariant Discriminant Classifier
considered to achieve multi classification.
The protein sequence P with L amino acid sequences, where In our approach, the dimension of APAA vector P is allowed
L represents the length of protein sequence, can be represented to vary. As we are using multiple classification approach in the
as: current study, we will give the different dimensions of the
R1R2R3R4R5 5 RL (1) vector P to classifier each time. For )=0, the classifier will be
where RI represents the amino acid sequence at position 1 and trained on 20-dimension vector (A=20 ± 2 x 0=20). For .=1,
.. . . .. ~~~the classifier will be trained on 22-dimension vector and so
RL is amino acid sequence at position L. Its respective APAA
composition is given in (2). For further details about the APAA forth up to )=2 1. This makes a total of 22 classifications of the
same data but with different composition each time on the
composition, see the appendix A in [16]. same classifier that can be fused together or simply 22
I
P [P1 P2 5 P20 5 PA ] (2) classifiers that can be fused together.
where A=20 + 2 * X (2 shows the numbers of tiers used in Given a dataset, S, of N proteins classified into M cellular
amphiphilic pseudo amino acid composition and X=O, 1, 2 attributes or classes, it can be generally formulated it in terms
....2 1). In our case X=21 as maximum is used which makes the of the union of M classes, that is:
dimension size of P equals A=20 + 2 21=62-dimensions.
* S=SlUS2US3US45 Sm (3)
P1,P2. , P20 are the frequencies of occurrence of 20 amino where each class Sm (m=1, 2, . .. ,M) is composed of proteins
acid sequences. The elements P21, P22 , PA are the 1st to X- with the same cellular attribute and its size (the number of
tier correlation factors of amino acid sequence in the protein proteins therein) is Nm. Obviously, we have N = N1 + N2 +
chain determined based on hydrophobicity and hydrophilicity. + NM. According to (2), the kth protein in the class Sm is
In Fig. 2 many helices of proteins, which are formed by formulated by:
hydrophobic and hydrophilic order patterns are demonstrated
in 'wenxiang' diagram in Fig. 2. Different proteins have l97 =, 25P 2 5Pt(4)IP,i
(al) mahalanobis distance are compared and the one with the
X4
,
MLr} ,,# IC
minimum distance found is the respective target class Sm of the
query protein P:
.< v =go} .-X <M!2f-id; M (P, Piu) Mn M (P, PI), M (P, P2), 6 M (P, PAM)} u=l, 2 ....M (1I1)
=
I
IP-rti Sequenc I.
=|
Cornbination Output
Clasfifer N %
Input N t utpt N
Fig. 3. A schematic diagram showing the process of prediction of protein using ensemble of classifiers
the classifier's prediction rate for the particular class is not 3,316 are correctly predicted and only 483 proteins sequence
good, still it has the same contribution in the predicted results, are miss predicted with an over all accuracy of 87.29 % for 14
Yj (16). sub-cellular locations. The results are compared in terms of
In [16], equal weightage is given to the prediction of each total number of correct predictions and accuracy with the
classifier in the process of fusing the outputs of individual techniques adopted in [6, 10, 16, 18] and show that the
classifier. This, however, does not efficiently exploit the accuracy of the proposed scheme improves over the results in
decision space of individual classifier. In contrast, in our [6, 10]. The improvement in accuracy over the results in [6, 10]
approach, we use the MCC to assign weight to the prediction is 44.29 % and 19.39 % respectively. The accuracy has been
of each classifier for each of the classes. This helps in giving improved by 1.89 % and 0.89 % as compared to that of
more weightage to the classifier trained on a more techniques used in [18] and [16] respectively. The results are
discriminative feature space. Therefore, during polling each shown in TABLE 2.
classifier participates in the polling process based on its MCC B. Results by Independent Dataset Test
and yields better performance over all. The results show that
by adopting this weighted-polling approach for each classifier, The suggested approach is also validated on independent
good performance is achieved. data set. The testing data set containing 4,498 protein
As a final step, the query protein P is predicted to belong to sequences have also been predicted using the rule-parameters
theclass,for whichitsscoreusing (16) is the highest;that is derived from the 3,799 protein sequences in the training
.
ProtLock [5] Amino acid 1,655 43.60 % 1,614 42.50 % 1,829 40.70 %
composition
Covariant discriminant Amino acid 2,580 67.90 % 2,339 61.60 % 2,751 61.20 %
[9] composition
Augmented covariant Amino acid 3,245 85.40 % 2,574 67.80 % 3,246 72.20 %
discriminant [23] compositiona
Ensemble classifier[21] Pseudo amino acid 3,280 86.40 %o 2,666 70.20 %o 3,331 74.10 %
composition b
Weighted ensamble Pseudo amino acid 3,316 87.3 0 % 2,704 71.15 % 3,377 75.07 %
classifier composition b
a: The series-mode [21] was used to calculate the pseudo amino acid composition with A=20 + X + t =20 + 13 + 13= 46
b: The amphiphilic mode (Appendix A) was used to calculate the pseudo amino acid composition with {A} =
{AI, Al, 6 , A, } = {20, 22,., 22}
TABLE 3. The Jackknife success rate Obtained by Each of the 22 Individual Classifiers. see (13)
Classifier a Dimension b Correct Predictionsd Accuracy Classifier a Dimesionb Correct Predictionsd Accuracy-