Prediction of Protein Sub-Cellular: Localization Through Weighted Combination of Classifiers

Prediction of Protein Sub-Cellular Localization
through Weighted Combination of Classifiers

M. Fayyaz*, A. Mujahid* A. Khan** and A. Bangash*
Faculty of Computer Science & Engineering,
Ghulam Ishaq Khan (GIK) Institute of Engineering Science & Technology, Swabi, Pakistan
* * Department of Mechatronics,
Gwangju Institute of Science and technology, 1 Oryong-Dong, Buk-Gu, Gwangju 500-712, Republic of Korea
mudassir.fayyaz@gmail.com, asifullah@gist.ac.kr
Abstract- Prediction of sub cellular localization of proteins is an biochemical experiments. During the last decade, the number
important step in genome annotation and in search for achieving of new protein sequences has increased to more than 50 times.
novel drug targets. Conducting experiments for extracting For instance the "Swiss-Prot" protein database [4] in 1986
information about protein sub cellular localization is both time
consuming and costly effort. Machine learning approaches, consisted of 3,939 protein sequences but this has now jumped
especially, ensemble of classifiers, providing efficient and reliable up to 201,594 sequences in December, 2005. Owing to such a
mechanism of computational prediction are thus highly desired. large number of newly found proteins, one needs to devise an
In this context, we propose a modification to the approach automated method for the efficient and reliable annotating of
proposed in [K. C. Chou, J. Cell. Biol. 99(2006)517]. We have
used a weighted polling method to fuse the output of individual thedsuclul lvocalizati of unhapracterize proteihns Th
Covariant Discriminant Classifiers. The individual classifiers are knowledge thus evolved out of this process can prove handy in
trained on features based on pseudo-amino acid composition of the drug discovery.
proteins. Three methods of verifications; re-substitution, There have been number of techniques employed in the past
jackknife, and independent data set tests have been employed and for the prediction of protein sub cellular localization. For this
give over all accuracies of 87.13%, 71.15% and 74.90%
respectively. The predicted accuracies are higher than that of the
existing schemes.
purpose, manymets have bn employedtin[-]hich
have used techniques based on Support Vector Machines
(SVM), Covariant Discriminant Classifier (CDC), Neural
Network, etc. The method used in [16] has employed fusion of
The cell is the structural and functional unit of all living different CDC by combining their results using a polling
organisms, and is sometimes called the "building block of life",. method. This method has shown good results as compared to
Humans are multi-cellular organisms and are estimated to have the results in [5-15]. The method described in the current study
1014 (100 trillion) cells [1]. A cell contains approximately 109 is an improvement to the one proposed in [16], by introducing
protein molecules in the different compartments or organelles weighted polling, which helps in better utilization of decision
of the cell [2] and each of these protein inside, has a role to space. Rest of the paper is organized as follows: section II
play. The different organelles of the cell include cell wall, cell describes our proposed technique, results and discussion are
membrane, cytoplasm, carbohydrates, chloroplast, voles, explained in section III, and section IV concludes the paper by
centrioles, endoplasmic reticulum, glgi apparatus, proposing some future work.
Mitochondria, lyosomes, proxisomes, cytoskeletons and
nucleus. The details about the structure and functions of these II. METHODS
organelles can be found in [3]. The different organelles inside
cell are shown in Fig. 1. A. Data Set
Each of the organelle has a function to perform that For the prediction of protein sub-cellular localization, the
contributes to the function of the cell. Many of the functions same training and testing data sets have been used as in
associated with these organelles are performed through [15].These sets consist of 3799 and 4498 sequences
proteins. If the protein at some sub cellular localization or respectively, belonging to 14 organelles inside the cell
organelle is predicted then the function of that sub-cellular containing protein molecules. The number of sequences of
localization can be predicted, which contributes to the overall proteins belonging to each of 14 organelles inside the training
function prediction of the cell. Accordingly, the significance to and testing data set is illustrated in TABLE 1. The codes of
identify the sub cellular localization of an uncharacterized datasets are available at:
protein has become self-evident. http://www.interscience.wiley.com/jpages/0730-23 12/suppmat.
It is time consuming and costly to extract information about B. Amphiphilic Pseudo Amino Acid Composition
the protein sub cellular localization by conducting various
Conventional amino acid composition use only the
TABLE 1 Number of Training & Testing Sequences belonging to each Sub
Extracell Chorpast cellular Organelle
Pk~~m~~ M~~mbr~~ne M~~~toc~~oncTha~~Cell Membrane69 2
Cell Wall
Ccntrl 71 35
Centrioles 65 4
Chloroplast 316 855
RFC,umT Cytoplasm 1113 186
GWg gl ,JK 6k 1g Cytoskeletons 249 131
Endoplasmic Reticulum 289 136
Extracell 393 1252
Golgi Apparatus 90 41
L~~~~~~~~~ysosomeslJ,
Lysosomes 123 57
Mitochondria 389 762
P 0 C Nucleus 399 914
Peroxisomes 147 84
Vacuoles 86 17
Total
Fig. 1. Schematic illustration to show the 14 subcellular locations of proteins: 3799 4498
(1) cell wall, (2) centriole, (3) chloroplast, (4) cytoplasm, (5) cytoskeleton, (6)
endoplasmic reticulum, (7) extracellular, (8) Golgi apparatus, (9) lysosome, different amphiphilic features corresponding to different
(10) mitochondria, (:11) nucleus, (12) peroxisome, (13) plasma membrane, and . ..
(14) vacuole. Note that the cell wall, chloroplast, and vacuole proteins exist hydrophobic and hydrophlic patterns. Fig. 2, at and a2 show
only in a plant cell, while the centriole proteins only in an animal cell. the 1st-tier correlation factors that reflect the sequence-order
Reproduced from [16] with permission. correlations between all the most contiguous amino acids
frequency of occurrence of each amino acid in the protein sequences along a protein chain through hydrophobicity and
sequence or use dipeptide composition of proteins to obtain a hydrophilicity, respectively. Similarly Fig. 2, bl and b2 show
fixed length vector for each protein, in order to train or test on the 2nd-tier correlation factors that reflect the sequence-order
classifier. Unlike conventional composition, APPA correlations between all the 2nd most contiguous amino acids
composition approach as in [16-18] is adopted in the current sequences along a protein chain through hydrophobicity and
study. Moreover, instead of using a fixed dimension for APAA hydrophilicity, respectively and so forth.
composition, multiple dimensions of APAA have been C. Covariant Discriminant Classifier
considered to achieve multi classification.
The protein sequence P with L amino acid sequences, where In our approach, the dimension of APAA vector P is allowed
L represents the length of protein sequence, can be represented to vary. As we are using multiple classification approach in the
as: current study, we will give the different dimensions of the
R1R2R3R4R5 5 RL (1) vector P to classifier each time. For )=0, the classifier will be
where RI represents the amino acid sequence at position 1 and trained on 20-dimension vector (A=20 ± 2 x 0=20). For .=1,
.. . . .. ~~~the classifier will be trained on 22-dimension vector and so
RL is amino acid sequence at position L. Its respective APAA
composition is given in (2). For further details about the APAA forth up to )=2 1. This makes a total of 22 classifications of the
same data but with different composition each time on the
composition, see the appendix A in [16]. same classifier that can be fused together or simply 22
I
P [P1 P2 5 P20 5 PA ] (2) classifiers that can be fused together.
where A=20 + 2 * X (2 shows the numbers of tiers used in Given a dataset, S, of N proteins classified into M cellular
amphiphilic pseudo amino acid composition and X=O, 1, 2 attributes or classes, it can be generally formulated it in terms
....2 1). In our case X=21 as maximum is used which makes the of the union of M classes, that is:
dimension size of P equals A=20 + 2 21=62-dimensions.
* S=SlUS2US3US45 Sm (3)
P1,P2. , P20 are the frequencies of occurrence of 20 amino where each class Sm (m=1, 2, . .. ,M) is composed of proteins
acid sequences. The elements P21, P22 , PA are the 1st to X- with the same cellular attribute and its size (the number of
tier correlation factors of amino acid sequence in the protein proteins therein) is Nm. Obviously, we have N = N1 + N2 +
chain determined based on hydrophobicity and hydrophilicity. + NM. According to (2), the kth protein in the class Sm is
In Fig. 2 many helices of proteins, which are formed by formulated by:
hydrophobic and hydrophilic order patterns are demonstrated
in 'wenxiang' diagram in Fig. 2. Different proteins have l97 =, 25P 2 5Pt(4)IP,i
(al) mahalanobis distance are compared and the one with the
X4
,
MLr} ,,# IC
minimum distance found is the respective target class Sm of the
query protein P:
.< v =go} .-X <M!2f-id; M (P, Piu) Mn M (P, PI), M (P, P2), 6 M (P, PAM)} u=l, 2 ....M (1I1)
=
I
(bi) D. Ensemhle Classifiers

| |u>( QQ~KI) We have used Q= max +1 (X=O, 1, 2, ..., 21 and ),ax=21)
(b2> classifiers for our study. With same training set and classifier,
Q ( but with different dimensions of training data i.e. by using
different values of A (A=20 + 2 * X), different prediction
kz; ! < | wgwresults are yielded. When )=O then classifier 1 is trained on the
Q Q Q Q Q training data set with the dimension A=20. When )=1, then
classifier 2 is trained on the training data set with the
dimension A=22 and so forth.
Suppose
Fig 2. A schematic drawing showing the amphiphilic correlation along a {A} = {A1,A2,6 ,AQ} (12)
protein chain. The correlation via hydrophobicity is shown in red, while the
represents the possible number of dimensions of training data
correlation via hydrophilicity in blue. Panel al/a2 reflects the 1st-tier coupling
mode between all the most contiguous residues, panel bl/b2 reflects the 2nd- that are given to an individual classifier and is given in (13) as:
tier coupling mode between all the second most contiguous residues, and
panel cl/c2 reflects the 3rd-tier coupling mode between all the third most {CD(A)} = {CD(A, CD(A2 ),6 CD(AQ)} (13)
contiguous residues. [Color figure can be viewed in the online issue, which is
available at www.interscience.wiley.com.] Each query protein P is predicted on each of Q=22 classifiers
and after this we need to employ a mechanism for fusing the
The standard protein vector for class Sm is given by: out put of all classifier to yield a better prediction rate. This is
I explained as:
Pm =1hm,i
Pm,21Pm2202 Pm,A] (5) =CD(Al)UCD(A2)®6 ®CD(AQ) (14)
1 Nm k Where is a fused classifier formed by the fusion of

Pm,i = N Y Pm i (i 12,6 , A) (6) individual classifiers CD (A 1), CD (A 2), . . ., and CD (A Q).
m k=1 . is the fusion operator. The process of fusion of these
The classifier used in the current study is covariant
discriminant classifier, which based on the calculation of
is
.ais th h p fusiontoperator.
mahalanobis distance [20-22] from the mean of each class in The process of how ensemble classifier works is as
feature space. The similarity between query protein P and each follows. Suppose Ql, Q2, Q3 Q Q are the predictions of
of standard protein vector Pm in (5) is defined by the following Q=22 classifiers on query protein P and these predictions
covariant discriminant function: belong to classes S,, S2, S31 ... SM as given by:
M(P7Pm)=Dar(PPm)+InlCm2 (m = 1,2,6 ,M) (7) {Q1,Q2,6 ,Q}ES1Sj,S2,6 SMI (15)
T27 pmpC(
Mar( X m)-( m C)
C -P-P
P-P
m)
I The voting score of query protein P belonging to jth class is
( 8)) given by:
where T symbolize the transpose and Dhar (P, Pm) is the Q=22
squared mahalanobis distance between query protein P and Yj = MCC. A(Q,,Sj) (j 1.2.6 ,M) (16)
Pm. Cm is the covariance matrix of classes Sm and is given by: 1
where delta function in (16) is given by:
m
C1 1,2 Cm1,A e
C12 A(Qi, Si){ QiGES (17)
Cm c2,1 c22 6 C2A (9) otherwise
7 7 9 7 Delta function gives us the prediction matrix of l's and O's and
m m 6 n of size Q x M (Q= total numbers of classifiers used and M=
CA,I CA,2 6 CA,A total number of classes used for classification).
MCC is the weight matrix of size Q x M. After the training
Nm k k of the individual classifiers on all of classes. we obtain the
Cj=/ L(Pm,i -Pm,lXPm,J -Pm,j) (i,j 1,2,6 rA)
(10) MCC [23] from the results obtained through each classifier for
m each of the class m (m=1, 2, ............,
M).
where Cm'1 is the inverse of covariance matrix and ICml is the
determinant of covariance matrix. All the calculated
Input 1 Output I
11" put -{vtput

1 K2 Fusing WOutputs
;l Clssifer; UsinaVmghte
Eniam1.,
IP-rti Sequenc I.
=|
Cornbination Output
Clasfifer N %
Input N t utpt N
Fig. 3. A schematic diagram showing the process of prediction of protein using ensemble of classifiers
the bracketed elements. If there is a tie, choose arbitrarily

'12 6
mcc1mlMCC1,2
MCC mcc1
MCC M among the maximum ones.
MCC = mcc21 mcc22 6 mcc2M (18)

7 7 9 7 III. RESULTS AND DISCUSSIONS
mccQ 1 mCCQ 2 6 mCCQ,M In order to evaluate the proposed technique, three types of
performance evaluation tests have been conducted. They are
p n - ufo re-substitution, jackknife and independent dataset
tests. The
n,=c f f e(19) results have shown that the proposed weighted classifier
+ocsf+1'
+U][p9 ][n}(+°
]nF_ +a}f }(I 9....... . approach proves to is better than many of the techniques used
earlier [6, 10, 16, 18]. The results are compared in terms of
In (19), Pdf is the number of correctly predicted sequences correct number of predictions and their accuracies
belonging to class f by classifier d, ndf is the number of A. Results by Re-Substitution Test
correctly predicted sequences not belonging to class f by
classifier d, udf and odf are the number of under predicted and Re-substitution test is an examination of the self-consistency
over predicted sequences belonging to class f by classifier d of a classifier. During this test the entire training set is
respectively. In [16], the weight factor, wj=l is used for each evaluated by the classifier trained on the same training
classifier (i 1, 2, Q) during fusion process, thus even if patterns. The results show that out of 3,799 protein sequences
=
the classifier's prediction rate for the particular class is not 3,316 are correctly predicted and only 483 proteins sequence
good, still it has the same contribution in the predicted results, are miss predicted with an over all accuracy of 87.29 % for 14
Yj (16). sub-cellular locations. The results are compared in terms of
In [16], equal weightage is given to the prediction of each total number of correct predictions and accuracy with the
classifier in the process of fusing the outputs of individual techniques adopted in [6, 10, 16, 18] and show that the
classifier. This, however, does not efficiently exploit the accuracy of the proposed scheme improves over the results in
decision space of individual classifier. In contrast, in our [6, 10]. The improvement in accuracy over the results in [6, 10]
approach, we use the MCC to assign weight to the prediction is 44.29 % and 19.39 % respectively. The accuracy has been
of each classifier for each of the classes. This helps in giving improved by 1.89 % and 0.89 % as compared to that of
more weightage to the classifier trained on a more techniques used in [18] and [16] respectively. The results are
discriminative feature space. Therefore, during polling each shown in TABLE 2.
classifier participates in the polling process based on its MCC B. Results by Independent Dataset Test
and yields better performance over all. The results show that
by adopting this weighted-polling approach for each classifier, The suggested approach is also validated on independent
good performance is achieved. data set. The testing data set containing 4,498 protein
As a final step, the query protein P is predicted to belong to sequences have also been predicted using the rule-parameters
theclass,for whichitsscoreusing (16) is the highest;that is derived from the 3,799 protein sequences in the training
.
dataset. The results show that out of 4,498 protein sequences

YA=Max{Y,Y2, 6 ,YM} ~=1,2. M (20) ..
3,377 are correctly predicted and only 1,121 proteins sequence
are miss predicted, with an over all accuracy of 75.07 00O for 14
Where operator Max means, finding the maximum quantity in
TABLE 2. Overall Success Rates for the 14 Sub-cellular Locations (Fig. 1) of Proteins by different Classifiers and Test Methods
Test Method
Independent
Re-substitution Jackknife dataset
Classifier Input form Correct Accuracy Correct Accuracy Correct Accuracy
Predictions % Predictions % Predictions %
ProtLock [5] Amino acid 1,655 43.60 % 1,614 42.50 % 1,829 40.70 %
composition
Covariant discriminant Amino acid 2,580 67.90 % 2,339 61.60 % 2,751 61.20 %
[9] composition
Augmented covariant Amino acid 3,245 85.40 % 2,574 67.80 % 3,246 72.20 %
discriminant [23] compositiona
Ensemble classifier[21] Pseudo amino acid 3,280 86.40 %o 2,666 70.20 %o 3,331 74.10 %
composition b
Weighted ensamble Pseudo amino acid 3,316 87.3 0 % 2,704 71.15 % 3,377 75.07 %
classifier composition b
a: The series-mode [21] was used to calculate the pseudo amino acid composition with A=20 + X + t =20 + 13 + 13= 46
b: The amphiphilic mode (Appendix A) was used to calculate the pseudo amino acid composition with {A} =
{AI, Al, 6 , A, } = {20, 22,., 22}
TABLE 3. The Jackknife success rate Obtained by Each of the 22 Individual Classifiers. see (13)
Classifier a Dimension b Correct Predictionsd Accuracy Classifier a Dimesionb Correct Predictionsd Accuracy-
CD (Al) 20 2268 59.70 CD (A12) 42 2613 68.78

CD (A2 22 2387 62.83 CD (A13) 44 2624 69.07
CD (A3) 24 24545 64.56 CD (A14) 46 2643 69.57
CD (A4) 26 2497 65.73 CD (A15) 48 2648 69.70
CD (A5 28 2552 67.17 CD (A16) 50 2632 69.28
CD (A6 30 2575 67.78 CD (A17) 52 2623 69.04
CD (A7 32 2596 68.33 CD (A18) 54 2601 68.46
CD (A8 34 2585 68.04 CD (A19) 56 2586 68.07
CD (A9 36 2590 68.18 CD (A20) 58 2553 67.20
CD (Alo) 38 2602 68.50 CD (A21) 60 2544 66.96
CD (Al1) 40 2603 68.51 CD (A22) 62 2562 67.44
a) In dividual basic classifier CD (Ai) (i =17 2_ ... Q)7 (13). b) The dimension of the amphiphilic pseudo amino acid composition considered here was given by Ai=20 +
2 * (i -1)X (i =1 2.. 22)7 on which each of the individua classifier CD (Ai) was operated. c) The overall jackknife success rate was derived from the training dataset
of 37799 proteins taken from [21 ]. d) TIhe number of correctly predicted sequences after th e classification.
sub-cellular locations. The results show that the proposed

scheme outperforms those of [6, 10] by 34.37 00 and 13.88 00 jackknife test is deemed as the most rigorous and objective
respectively. The accuracy has been improved by 2.88 00 and one.
0.98 00 in comparison to techniques used in [18] and [16] The results show that out of 3,799 protein sequences 2,704
respectively. The results are shown in TABLE 2. are correctly predicted and 1,095 proteins sequence are miss
C. Results by Jackknife Test predicted, with an over all accuracy of 71.15 00 for 14 sub-
cellular locations. The results are compared with that of [6,
Jackknife test is conducted for the cross validation of protein 10, 16, 18] and an improvement of 28.6, 9.55, 3.35 and 0.95°/
data. For the cross-validation by jack-knifing, each of the respectively has been achieved in the accuracy. These results
proteins in the data set is in turn singled out as a test sample are shown in TABLE 2.
and remaining all samples are used to train the classifier. TABLE 3 contains the jackknife classification results on
Among all three these three testing schemes, however, the individual classifiers. It is observed that none of the individual
classifier has accuracy greater than that of the weighted 8. Chou KC, Elrod DW, "Prediction of membrane protein types and
ensemble classifier. subcellular locations," Proteins: Struct Funct Genet, vol. 34, pp. 137-153,
1999a.
9. Nakai K, "Protein sorting signals and prediction of subcellular
IV.*V. CONCLUSION
CONCLUSION
localization.,"Adv Protein Chem, vol. 54, pp. 277-344, 2000.
Amphiphilic pseudo amino acid composition yields better 10. Chou KC, Elrod DW, "Protein subcellular location prediction," Protein
prediction rate then using normal discrete
9 compositions, as itt En. volr
12, ppn. 107-118, 1999b. of protein subcellular locations by
11. Park KJ, ~~~~~~Kanehisa M, "Predictionofptensbluarocinsy
exploits the correlation between the contiguous amino acids support vector machines using compositions of amino acid and amino acid
sequences inside a protein. The process of fusing classifiers pairs," Bioinformatics, vol. 19, pp. 1656-1663, 2003.
based on the weighted MCC matrix helps in achieving high 12. Zhou GP, Doctor K, "Subcellular location prediction of apoptosis
proteins," Proteins: Struct Funct Genet, vol. 50, pp. 44- 48, 2003.
accuracy.This
accuracy. iS because it gives each classifier a weightage
This is because it gives each classifier
in
weightage in
a
13. Gao Y, Shao SH, Xiao X, Ding YS, Huang YS, Huang ZD, Chou KC,
the process of polling based on its accuracy for the respective "Using pseudo amino acid composition to predict protein subcellular
class. The fusion of individual classifiers leads to better location:Approached with Lyapunov index, Bessel function, and
prediction rate for the protein sequences and the accuracy of Chebyshev filter," Amino Acids, vol. 28, pp. 373-376, 2005.
ensemble of classifiers is any rof
14. Garg A, Bhasin M, Raghava GP, "Support vector machine-based method
ensemble Of classifiers greater than
iS greater than the accuracy of
the accuracy Of any Of for subcellular localization of human proteins using amino acid
individual classifier. In future, we intend to use Genetic compositions, their order, and similarity search," JBiol Chem, vol. 280, pp.
Programming for combining individual classifiers [24]. 14427- 14432, 2005.
15. Shen HB, Chou KC, "Predicting protein subnuclear location with
optimized evidence-theoretic K-nearest classifier and pseudo amino acid
composition," BiochemBiophys Res Comm, vol. 337, pp. 752-756, 2005b.
ACKNOWLEDGEMENTS 16. Kuo-Chen Chou, Hong-Bin Shen, "Predicting Protein Subcellular Location
We are very grateful to Ghulam Ishaq Khan Institute of by Fusing Multiple Classifiers," Journal of Cellular Biochemistry, vol. 99,
pp.517-527,2006.
Engineering Sciences and Technologies (GIKI) for providing 17. Kuo-Chen Chou, "Using amphiphilic pseudo amino acid composition to
healthy and rich research environment as well as moral support predict enzyme subfamily classes," Bioinformatics, Vol. 21, no. 1, pp. 10-
for carrying out this research. This work has been supported by 19, 2004.
National Engineering & Scientific Commission of Pakistan 18. Chou KC, "Prediction of protein cellular attributes using pseudo amino
acid composition," Proteins: Struct Funct Genet, vol. 43, pp. 246- 255,
(NESCOM). 2001a.
19. Chou KC, Zhang CT, Maggiora GM, "Disposition of amphiphilic helices
in heteropolar environments," Proteins: Struct Funct Genet, vol. 28, pp.
REFERENCES 99-108, 1997.
20. Mahalanobis PC, "On the generalized distance in statistics," Proc Natl Inst
1. W. Bechtel, Discovering Cell Mechanism: The Creation Of Modern Cell Sci India, vol. 2, pp. 49-55, 1936.
Biology, Cambridge University Press, 2005. 21. Pillai KCS, "Encyclopedia of Statistical Sciences. John Wiley & Sons,"
2. B. Alberts. Cell Movements and the Shaping of the Vertebrate Body, New York, pp 176-181, 1985.
Molecular Biology of the Cell, 4th edition, Garland Science, 2002. 22. Chou KC, Zhang CT, "Predicting protein folding types by distance
3. Radford T, "Metaphors and dreams," The Scientist, vol. 17, pp. 24-26, functions that make allowances for amino acid interactions," J Biol Chem,
2003. vol. 269, pp. 22014-22020, 1994.
4 Bairoch A, Apweiler R, "The SWISS-PROT protein sequence data bank and 23. Matthews BW, "Comparison of predicted and observed secondary structure
its supplement TrEMBL,," Nucleic Acids, vol. 25, pp. 31-36, 2000. of T4 phage lysozyme," Biochim Biophys Acta vol. 405, pp. 442-451,
5. Nakashima H, Nishikawa K, "Discrimination of intracellular and 1975.
extracellular proteins using amino acid composition and residue-pair 24. Abdul Majid, A. Khan, A. M. Mirza, "Combination of SVM using GP,"
frequencies,"JMolBio, vol. 238, pp. 54-61, 1994. International Journal ofHybrid Intelligent Systems, vol 3, Issue 2, pp: 109
6. Cedano J, Aloy P, P'erez-Pons JA, Querol E, "Relation between amino acid - 125, 2006.
composition and cellular location of proteins," J Mol Biol, vol. 266, pp. 25. http://www.interscience.wiley.com/jpages/0730-2312/suppmat.
594-600, 1997.
7. Yuan Z, "Prediction of protein subcellular locations using Markov chain
models," FEBS Letters, vol. 451, pp. 23-26, 1999.

Prediction of Protein Sub-Cellular: Localization Through Weighted Combination of Classifiers

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Prediction of Protein Sub-Cellular: Localization Through Weighted Combination of Classifiers

Uploaded by

Copyright:

Available Formats

Prediction of Protein Sub-Cellular Localization

through Weighted Combination of Classifiers

(bi) D. Ensemhle Classifiers

1 Nm k Where is a fused classifier formed by the fusion of

11" put -{vtput

the bracketed elements. If there is a tie, choose arbitrarily

MCC = mcc21 mcc22 6 mcc2M (18)

dataset. The results show that out of 4,498 protein sequences

CD (Al) 20 2268 59.70 CD (A12) 42 2613 68.78

sub-cellular locations. The results show that the proposed

You might also like