You are on page 1of 6

Performance of Different Approaches for Predicting the Subcellular Locations of Proteins: A Review

Muhammad Taskeen Raza
Department of Electrical Engineering, VET Department of Electrical Engineering, LCWU Lahore, Pakistan

Noor M.Sheikh
Department of Electrical Engineering University of Engineering & technology Lahore, Pakistan

Muhammad Abuzar Fahiem
Department of Computer Science Lahore College for Women University Lahore, Pakistan

Ahmed M. Mehdi
Institute for Molecular Bioscience The University of Queensland Brisbane, Australia

Abstract-Subcellular location of a protein is closely related to its function. Knowing the subcellular localization of proteins is important in molecular cell biology, proteomics, and system biology and drug discovery. Different predictors have been developed that predict presence, location and interaction of molecules using various of computational Machine techniques and including artificial probabilistic models Learning

subcellular location of particular proteins even seems more valuable. Wide range of Genomic projects has generated protein

sequence, interaction and expression data of almost every organism in large quantity [I]. However, the functions of many proteins are still un-annotated. Different research efforts are being carried out to interpret the role of these proteins in the cell particularly and in the organism generally. There are a number of ways through which Genome function annotation can be interpreted, like, protein composition, protein structure and shape, protein interaction, subcellular location, etc. Subcellular Locations information of the proteins is the key regarding this knowledge discovery in Proteomics. Subcellular locations are actually the compartments in the cells of the living bodies with defined walls and boundaries in eukaryotic cells. These compartments vary in number and their functions in cells. Cells can be classified in two main types, prokaryotic cells and eukaryotic cells. Prokaryotic cells are simpler, lacking of nucleus and defined boundaries, best­ characterized example is of bacteria cells. While eukaryotic cells are greater in size and volume and having membrane­ bound well defined boundaries of the cell compartments, the subcellular locations, example is cell of mammalian. A typical mammalian's eukaryotic cell has the following subcellular locations, Nucleolus, reticulum and Nucleus, ribosome vesicle, rough endoplasmic reticulum, Golgi apparatus, cytoskeleton, smooth endoplasmic lysosome, mitochondria, These vacuole, are the cytoplasm, possible centriole.

intelligence algorithms.

These predictors partially cover the

different aspects of exploration of subcellular locations. Some of them are equally well applicable to many types of organisms (human, yeast, mouse, bacteria) while some are specific and focus on better performance in accuracy of the predicted results. Similarly some of the techniques cover "few" number of proteins but more accurately and on the other side some algorithms predict sub cellular locations of "many" proteins at the expense of prediction accuracy. This research is a review of most common and efficient techniques grouped in four in total, which are 1amino acid composition and order-based predictors 2-sorting signal predictors 3- homology-based predictors and 4-hybrid methods that use several sources of information to predict localization. The work Elucidate the performance and coverage comparisons among the subcellular locations predictors.

Keywords- Predictor; Subcellular Location; Organisms; Machine Learning; Eukaryotic Cells.



Predicting subcellular localization of a protein is very crucial in determining the protein's role in the cell in general and for drug discovery process in particular. If we come to know the subcellular localization of protein in the cell, formulation of the drug and its targets can be suggested. After having reliable information of subcellular location of a protein the drugs molecules can be affixed with the protein of interest to reach its target. However, due to mutations in genes and proteins, there are high chances of unusual subcellular localizations of the proteins, such as, in case of diseases like cancer. So in this scenario the importance of information about

accommodations of the proteins. As far as Subcellular location is concerned in the cell, it is not only un-annotated itself but also there is the issue of multi compartmental Proteins. Some proteins keep on changing their localization in the cells depending on their role and function defmed by the nature in the cell. Experimental approaches and methods are being employed for the knowledge of the subcellular locations. This work is very much laborious and time consuming as well, especially when there are huge


bioGRID etc for getting the Biological data like protein Sequence data. NucProt. but on the past record and experience of the events. Fortunately the issue can be resolved with satisfactory efficiency when we change the domain of analysis of such problems. MATERIALS & METHODS UTILISED Protein composition: provides the information about the constituent of the proteins and their order. the case under consideration. Some of the protein characteristics described below. As far as accuracy of the results is concerned definitely experimental methods are the better options always. Lifedb GFP dataset. even some developing countries are also coming forward in this cause for the service to humanity ultimately. protein expression data and Signal peptides codes. the inputs are the proteins and the desired output is the inference of subcellular locations of those proteins. 79 % Scott [6] InterPro Motifs. B. researchers used other probabilistic • BioGRID: For protein Interaction data. some [4] apply Neural Networks. Later on. Although these techniques then Networks. Researchers Features utilized Accuracy 81% Z. section II is about the techniques employed. And also BN takes into account the posterior probabilities and improves the prediction accuracy of probabilistic inference. Table I.signal peptides [9.8] Other commonly used datasets are Hera human dataset.numbers of proteins to be experimented in this fashion. protein features utilized and data sets used by researchers. of the human beings especially and other living organisms generally. protein interaction . protein interaction data.4] Statistical Approaches used for the inference and prediction of subcellular locations of proteins are different in their objectives and results and also by techniques. • Protein expression: provides the information about the protein density/quantity in the cell at particular location. Interestingly. can of Markov's Random Fields and be used for the inference of are such proteins. 6] of them use Bayesian Networks and some Bioinformatics Models.Lu [1] Text annotation of Homologs. but there is always a trade-off between throughput and accuracy of probabilistic models. and its constituents significantly the proteins then will be in much better position to cure and control the diseases and disorders in the cells. • Protein Interaction: provides the information about the corresponding interacting proteins in any compartment.10 ]. In this paper. II. So there should be information about the features and properties of proteins that may lead to inference properly. the cell. 85% Scott [12] Selecting biological features determines unique and important characteristics of the input data on the basis of which input information can be categorized into desired output groups. which offers a variety of methods and techniques. They are utilizing different features and characteristics efficiently for the input query proteins like Amino acid Composition and order [3. in [12] they refined their approach by using more than one protein features in Hybrid design of predictors to achieve the better accuracy in prediction. In this way when we perform the statistical analysis of the results of these models and compare with experimental results. • localizations with the relatively probabilistic models but when we compare their experimental probabilistic model are much ahead because we are to annotate the billions of proteins which have been discovered in their composition only. Yeast data set. Few of the data resources in use by the analysts are following. Some of them infer more accurate results while some cover the more number of proteins. 94% 91 .[7. These Probabilistic models. A. They also use different Datasets from reliable resources like Swissprot. Datasets Genomic Projects are and proteomics Research is live and dynamic area for the researchers throughout the world. In Protein Motif. called "prior probabilities". If we will be having more and more accurate knowledge about the basic building block of Life. Machine Learning is the area of statistical pattern recognition. • UniprotiSwissprot : for Protein Sequence data and For Localization Data[3. 78% Olof [9] Signal peptides. And Importantly the machine learning models especially like Bayesian Networks (BN) which are also called the "Belief Networks" are not based on the frequency concept of the occurring events. some [5.4]. interProMotifs [6] etc. Similarly some authors [2. we achieve the very close results in some cases and even better in the situations when the experimental methods are difficult to perform due to unavailability of resources and/or inaccessibility to the samples. the life. for predicting the subcellular locations of the proteins. including the Neural Networks. 3] employed the Support Vector Machine techniques. In [6] the researchers earlier utilized only interPro Motifs as main protein features. Section III discusses the results of those research works including the comparative study of the selected models and the methods of statistical analysis commonly in use. not in their functions.signal peptides. Features Sujun [3] Protein composition. Support Vector Bayesian subcellular throughput machines. provides the concise information about these features used by various researchers and also their relation with accuracy of the results. Table I: Protein features utilized by different Researchers.

Further all the networks used in the model were of feed forward type. the 3 subcellular locations (Cytoplasmic. Hua S. Detailed description about the above mentioned approaches is given in the following section. Emanuelsson O.This tool is based on Neural Network technique. Accuracy of prediction for prokaryotic cells was achieved up to 91. extract the text from the homologs and then use classifier approach considerable performance over the previous approaches of amino acid composition and signal peptides. the chloroplast. named SubLoc. Homology-based Predictors. In this training procedure the ith SVM is trained with all the samples in the ith class having positive labels and also all other samples with negative labels are used. Further they trained the SVM model by I-v-r (one-versus-rest) approach. All these techniques used for the prediction of desired protein locations in the cells can be categorized in the following groups from the.Data set used from SWISS-PROT.[4 ]Several attempts have been done which use this protein feature for the prediction of subcellular locations. [3] developed a prediction system for subcellular localization. Other features of this model are following. It predicts the mitochocondrian. because the binary classifier is easy to implement. Support vector machines. mitochondrial targeting peptides (mTP).These results were more accurate than previous works of Reinhardt & Hubbard[4]. Existting Approaches prediction accuracy jackknife test was used.Some of the key points about their strategy are following. While in non plant version this layer contains two presequence mTP and SP because it predicts about these two sites only. and having the zero or one layer of hidden neurons all the ones were trained using error back propagation Method. Hybrid Predictors. Bayesian network and others and feature wise Like protein sequence. Mitochondrial. Correct positioning of the protein in the cell is very crucial and is linked with dynamic organization of a cell. Incorrect sorting can cause several diseases in cell such as cancers [15].e. input Protein Characteristic.4% and for eukaryotic cells up to 79. called protein sorting signal or signal Peptides. point of view 1. 2. Prediction accuracy for plants is 85% for 4 subcellular locations while for Non Plants it is 90% for three subcellular 10cations. To examine the 92 . They used the database text annotation of homologs. This process is carried on the basis of information which is contained in the protein itself. They applied the Support Vector Machine (SVM) techniques for this predictor and utilized the amino acid composition feature of the proteins effectively. Lu Z. which are actually presequence i. They modeled five classifiers for predicting subcellular locations of five organism plants. Support Vector Machine (SVM) technique itself was proposed by vapnik (1995. They predicted Periplasmic.This known subcellular has shown locations.4%. animals. technique wise like Neural networks. and co-workers [9] proposed a tool for the prediction of subcellular locations of proteins. A signal peptide is usually 3-60 long amino acid chain.SVM is now very common tool being used for different machine learning applications in various areas. C-2: Sorting Signal Predictors There are many Machine Learning Approaches which are being applied for the inference of subcellular locations of the proteins. fungi. homology based predictors are based on the principal that do the similarity search on the sequence data for the for feature texts [1]. Protein is composed of amino and acid groups chemically. 3. They developed SubLoc by two classifiers mainly. gram positive bacteria and gram negative bacteria. C-J: Amino acid composition and Order based Predictors Amino acid based composition predictors use the composition information of the proteins. So it was a multiclass classification issue. because of 20 unique amino acid compositions. 4. input vector used for the SVM of dimension 20. C-3: Homology-based Predictors Amino acid composition and Order based Predictors. protein interaction protein expression and others. Extracellulafor. chou & Elord[16]. Sorting signal predictors. named as TargetP. The d output of the first layer is applied to 2n Integrating Network Layer which ultimately outputs a score generated by the model for each input query protein. and thus predicting the subcellular location for which having the highest score. and It is usually believed that proteins with similar sequences perform same function [3]. TargetP was developed using two layer neural networks. we refer to that for complete approach of the technique. & Sun Z. it has been found experimentally that proteins which are located in any specific compartment of the cell have a special composition and order which is nearly common to all of the proteins of one compartment. secretary pathways locations and "others "in the cell. Protein sorting is the process by which cell accurately transports protein to desired subcellular location in the cell. Other prominence of their work is that the coverage in terms location coverage. They simplified the classifiers design by decomposing the multi classification to series of binary classification. choloroplast transit peptides (cTP) and secretory pathway (SP) signal peptides. one 3-c1ass classifier for prokaryotic cells one 4-c1ass classifier for prokaryotic cells. and colleagues [1] have employed this approach as part of the ongoing efforts for the annotations of protein subcellular locations. Therefore. To develop the system open source software SVM light was used. First layer consists of separate sub network for each of the input signal peptides. They [3] used the SVM with the following features. taxonomic coverage and sequence coverage were much better than the previous Extracellular) for prokaryotic cells and 4 subcellular locations (Cytoplasmic. These compounds are bonded with each other in specific pattern and order for a given proteins. Accuracy for these classifiers is such that 81 % for fungi and 93 % for other four classifiers.1998 ) for the applications of Pattern recognition.C. Nuclear) for eukaryotic cells through their Prediction system.

peroxisome. even of 100 %. [11] proposed a hybrid predictor based on Bayesian networks. They utilized this method by integrating 30 diverse features of proteins. Nodes of the Bayesian network are random variables. The Bayesian network model BN has the beauty of data integration as well as ability to compensate for missing data. Later on. Also classifier must be trainer before it can be used for the inference. Firstly nine of the subsets are used to train the th predictor and remaining 10 subsets is used for inference. Other key points of their approach are following. For nine subcellular locations it achieved the accuracy of 78% when analyzed by 10 fold cross validation test by covering 74% of the HomoSapien Proteins. By doing so variance estimate is calculated[3. A Bayesian network is a best possible method for making hybrid models and to integrate huge amount of data. protein composition data. lysosome. Their approach was also based on the Bayesian network technique. COMPARATIVE STUDY A. It was also applied to related species and found accuracy of more than 80%. C-4: Hybrid Predictors Golgi apparatus.the classifier predicts the subcelular location of query protein. the endoplasmic reticulum. Some of them are following. testing and training revealed also that out of those 30 features nearly half of them were redundant. Thus the prediction accuracy of subcellular locations of the proteins was improved. This model was implemented at two levels. • 10 Fold Cross validation Test: In this method input data is randomly partitioned into 10 non overlapping subsets. So the location coverage is 100%. 3]. Motif module. protein signal peptides and protein expression data etc at a time and also these predictor may use each of these data type from more than one resource to compensate for deficiencies in data availability. The motif module is based on InterPro motif in the proteins . ill. plasma membrane.A key characteristic of the model was integrating expression data with protein sequence data.thus obtain the homologs of the query sequence on the basis of the Boolean values results of the query protein . which may be of type Boolean or continuous. Missing data for some of the proteins in datasets is also the problem while predicting locations of proteins in cells. Hidden or Known. Later Scott M S. Earlier Drawid A. In this model BN integrates different Protein data.After this by using the more similar homologs . Biological Data available for the annotation of subcellular locations of proteins is incomplete and inaccurate as well in some cases. Bayesian network are Directed Acyclic Graphs (DAG) also called Belief Network. The tool developed by them named Proteome Analyst (PA) has also unique feature of providing open access to the user on web and allowing them to make their own customized classifiers for desired results easily.[12] • Jackknife Test: basic idea behind out this variance more estimator is that after leaving one or observations from the sample test. & Gerstein M. One important characteristic of this hybrid model is that each basic module has complementary role in overall performance of the predictors. having not any sort of experimental verification. Then this procedure is repeated with all 10 subsets [2. nucleus. So the Hybrid Predictor are the one which use different types of data as input like. Targeting module in reverse has the highest coverage. Statistical Analysis To verify the results of the proposed predictors various statistical methods have been employed to obtain the accuracy and also compare with other predictors. Verification of result was analyzed using 10 fold cross validation method. At second level a naive Bayse network is used which combines the predictions of all three basic modules.The logical solution in such type of problems is to integrate all these data which is either incomplete or inaccurate or both and further when they are obtained by diverse resources. Salient features of their latest predictor are it is applicable to human proteins.cs.4] 93 . protein interaction data. like Protein interaction data. and co-workers [6] utilized this approach in refining protein subcellular locations. Each of these nodules can independently predict subcellular location of the input proteins by employing their corresponding protein features. namely. Accuracy of their predictor was 75%. This training is done by using labeled training data. but results have low accuracy.12] • Self Consistency test: In this analysis the same dataset is used for training the predictor first and then for the inference. And also have the arcs joining the nodes showing conditional dependencies of nodes on their parent nodes. This tool is available at http: www. Because some of the data is generated on the bases of results by probabilistic models only.Also it annotates the multi compartmental proteins. statistic estimate is recomputed. More recently Scott M S with his co-workers [12] improved their previous PSL2 predictor to PSLT.ualberta. So all the three earlier discussed categories suffer from these problem of input data used for inference . on first level three independent modules. thus refining the prediction accuracy. it can predict all the subcellular localization of proteins in the cell. cytosol. mitochondrion and extra cellular space. Thus the beauty of integration is revealed by ultimately providing the highest coverage and accuracy for yeast proteins out of all the previous works in this organism. By integrating the data the compensation can be done while predicting the results. targeting module and interaction module.The target module utilizes the Protein signal peptides feature while the Interaction module used the protein-protein interaction information from the CORE dataset. used by[6. This prediction system named PSL2 is able to predict the localization of all the yeast proteins into 9 compartments of the Specific Working approach of their work is that the query sequence is compared/matched with SWISS-PROT database(having known subcellular locations) entries . Like Interaction module which is the most accurate of the three but has least coverage on the other hand.which is also based on Bayesian network model.

plants.0% 93. coverage.0% 02 09 09 self-consistency test 10 fold Cross validation 10 fold Cross validation Hybrid Methods Drawid M. Category Author Technique Accuracy Location Coverage Taxonomic Statistical Test Amino Acid Composition predictors Hua S and Sun Z [3] Support Vector Machine 91. Some of them [6. But still we can have a brief and used. Because some of the authors [4.4% 79. 7] are in the pioneers in this area while some who have proposed innovative approach [1. Some of the models are presented in this paper to have the overall view and know the different aspects of the various approaches.0% Sorting signal predictors Emanuelsson 0 et al [9] Neural Network 85.[6] Bayesian Network 78% 09 Human 10 fold Cross validation 94 . 6] later. the most efficient approach is of Hybrid Predictors.B. especially when we want the inference for human proteins which are relatively lacking in data availability as compared to other species. Hybrid predictors not only get reasonable accuracy but also their coverage for locations and Taxonomic as well is significant. accuracy and statistical analysis method employed. Gramnegative and Grampositive bacteria Yeast and worm fungi animals. l ] [ 50% 81. The comparison shows that due the nature of available data.0% 05 10 fold Cross validation self-consistency test Scott M S et al.0% 90.4% 03 04 03 04 03 03 04 prokaryotic eukaryotic prokaryotic eukaryotic Plants Non-Plants eukaryotes. technique Genome annotation. Comparisons comprehensive look on the performance of their proposed models. 10] focus on the coverage of the proteins while others on the prediction accuracy [1.0% Bendtsen JD et al [10] neural network and hidden Markov BLAST search NB Classifier 75% Homologybased predictors Marcotte E M et al [13] Lu Z et al. to have protein function knowledge is a live research topic. so many researchers and groups are contributing their services in this cause. Gram-Negative Bacteria and GramPositive Bacteria Yeast Jackknife test Jackknife test Jackknife test Jackknife test Redundancy-reduced Test Redundancy-reduced Test 10 fold Cross validation Reinhardt and Hubbard T[4] Neural Network 81.[I!] & Gerstein Bayesian Network 75. 9].0% 66. When we come towards the comparison of these models it's not simple and also not the fair. TABLE II: Comparison of Different Subcellular Locations Predictors. In this evaluation the parameters considered are. Table II is showing such comparison between different categories of predictors and further different techniques used.

D. . Matthew R. Muhammad Shoaib B. Cenk Sahinalp. Then different aspects of the research work for modeling of subcellular location predictors were covered.Vol. "Using Neural Networks for prediction of the subceluular location of proteins" Nucleic Acids research. 1211512120. Gabor E. "Support vector machine approach for protein subcellular localization prediction. pages 23-26. James R.2. B. 4 . Bostjan Kobe .wikipedia. 1005. J. Scott. Tusnady. Foster and Fiona S." Improved Prediction of Signal Peptides: SignalP 3. van der Bliek. Bioinformatics. Michael T. Brinkman. PLoS Computational Biology. Gardy. Gunnar von Heijne and SIMen Brunak. 783-795. 2010. pp. .1998.No. Kenta Nakai and Fiona S. Raymond Lo.Genome Research." Bioinformatics. P.No.Nucleic Acids research. Mehdi.Hubbard. research area in the Bioinformatics. Scott.1016. This shows that their performance varies in the coverage and accuracy significantly. [13] Edward M. 4. [II] Amar Drawid and Mark Gerstein. Elrod.Joumal of Molecular Biology." Protein subcellular location prediction". Wagner. Hallett.2000. pp." A Bayesian System Integrating Expression Data withSequence Patterns for Localizing Proteins: Comprehensive Application to the Yeast Genome". 721-728. Volume 12. Volume I . David Y. data set resources used and different machine learning techniques like neural networks. few of the results were presented in the form of comparative table.2011.2004. No. Timothy L. Gabor Melli. 6. 9. Volume 26. Jan 22.No. D. 20. The Journal of Biological chemistry Vo1. Lu. Sehgal. 2004. Lu. 1999. pp.9. pp. pp.No. 4. Jennifer L." Refining Protein Subcellular Localization". pp.2000.48."PSORT-B: improving protein for Gram-negative bacteria subcellular localization prediction ". Poulin. R. the inference of sub cellular localization. PEDS.IV.S. Hallett. L.No. 4. 2001 A ReinHardt and T. Cory Spencer. Katalin deFays. Henrik Nielsen. pp.PNAS. Michelle S.At the end. 107-1181998. Sujun Hua. pp.45765-45769. Apr 24. I . No. Ahmed M. 1608-1615. C.2003. Vol. [15] http://en. At start the motivation and the background behind the topic was discussed. 31. Marcotte §.L. July 16.277. Volume 27 No. 3613-3617. 1957-1966. 1059-1075. 26 No. August 25 . Greiner. Wishart. Volume 301. No. Laird. Martin Ester. including. July 2 1. Calafell. Alexander M. 14.Vol.Journal of MolecularBiology. No. 2004.0: improved protein subcellular localization prediction with refined localization subcategories and predictive capabilities for all prokaryotes". [6] 95 . [2] [3] [4] [5] [16] Kuo-Chen Choul and David W. Istvan Simon. Silren Brunak and Gunnar von Heijne. November 2005. Bayesian network etc were reviewed in the context of this inference .org/wikilProtein_sorting REFERENCES [I] Z. 451 issue no. February 28. pp.No. 1239-1246. and David Eisenberg. Martin Ester.Journal of Molecular Biology. March 9. Olof Emanuelsson. 2004." Predicting Subcellular Localization via Protein Motif Co-Occurrence"." Bioinformatics. pp. 17 No. Phuong Dao. Macdonell and R. Yu. pp. Ke Wang." Predicting Subcellular Localization of Proteins Based On their N-Terrninal Amino Acid Sequence". [14] Nancy Y.547-556.0 ".Vol. Volume 340. Henrik Nielsen." Prediction of protein subcellular locations using Markov chain models"FEBS letters Vol. Sebastien Rey. 13. Volume 300. Ioannis Xenarios. Thomas and Michael T.13. Kuo-Chen Chou and Yu-Dong Cai "Using Functional Domain Composition and Support Vector Machines for Prediction of Protein SubcellularLocation. Szafron*. support vector machine. Christophe Lambert. Volume 2004." A probabilistic model of nuclear import of proteins". SUMMARY [7] In this review paper we discussed the very hot important Zheng Yuan. Sara J. Because that they all focused on different aspects of performance for the subcellular inference in their prediction models." Localizing proteins in the cell from theirphylogenetic profiles". 22. protein features commonly utilized. Volume 97. David Y. S." PSORTb 3. Eisner "Predicting subcellular localization of proteins using machine-learned classifiers. pp. No. Aug 16. 8 . [8] [9] [10] Jannick Dyrlov Bendtsen. Bailey'and Mikael Boden. Thomas.2002 Sujun Hua and Zhirong Sun. Bioinformatics. [12] Michelle S. Brinkman. Anvik. Leonard 1. May " 14.