You are on page 1of 5

Decision Tree Classifier for Human Protein Function Prediction

Manpreet Singh
College Ludhiana
Lecturer I.T Guru Nanak
Dev

Engineering

mpreet78gyahoo.com

Assistant Professor CSE Guru Nanak Dev Engineering College Ludhiana parvinder.sandhuggmail. com

Parvinder Singh

Professor and Head CSE Guru Nanak Dev University Amritsar hardeep_gndugrediffmail. com

Dr. Hardeep Singh

Abstract - Drug discoverers need to predict the functions of proteins which are responsible for various diseases in human body. The proposed method is to use priority based packages of SDFs (Sequence Derived Features) so that decision tree may be created by their depth exploration rather than exclusion. This research work develops a new decision tree induction technique in which uncertainty measure is used for best attribute selection. The model creates better decision tree in terms of depth than the existing C4.5 technique. The tree with greater depth ensures more number of tests before functional class assignment and thus results in more accurate predictions than the existing prediction technique. For the same test data, the percentage accuracy of the new HPF (Human Protein Function) predictor is 72% and that of the existing prediction technique is 44%.

1.2 Decision Tree Induction A decision tree is a flow-chart-like tree structure, where each internal node denotes a test on an attribute, each branch represents an outcome of the test, and leaf nodes represent classes or class distributions. The top-most node in a tree is the root node [2]. In order to classify an unknown sample, the attribute values of the sample are tested against the decision tree. A path is traced from the root to a leaf node that holds the class prediction for that sample.

1.3 Techniquesfor Protein Function Prediction

Index Terms Sequence Derived Features, attribute, Human Protein Function, predictor.
I INTRODUCTION

There are various experimental techniques for identification of protein and determination of its 3-D structure [9].

1.1. Importance ofHuman Protein Function Prediction

The importance of human protein function prediction lies in the process of drug development. Drug development has two major components [9]. 1. Discovery and 2. Testing
1. The testing process involves preclinical and clinical trials. The computational methods are not generally subjected to produce significant enhancement in testing processes of drugs. 2. But in the discovery process the computational methods are very helpful. The drug discovery process is labor intensive and expensive and has provided a fertile ground for bioinformatics research. Bioinformatics promises to reduce the labor associated with this process, allowing drugs to be developed faster and at a lower cost.

Experimental Techniques: A) For Protein Identification - 2-D Electrophoresis, mass spectrometry and protein microarrays [9]. B) For 3-D Structure Determination - X-ray crystallography and nuclear magnetic resonance (NMR) spectroscopy [9].

Computational Techniques: The computational techniques for identifying unknown proteins and for predicting their structure and functions
QMIMM scheme i.e. the Quantum Mechanical/Molecular Mechanical scheme is used by software named GAMESS (General Atomic and Molecular Electronic Structure System) to predict an unknown protein. It requires a large computer memory to perform mathematical calculations and it runs on Linux operating system. 2. A software named as SWISS- Model is used for automated building of the theoretical structural models of a given protein (amino-acid sequence) based on the known proteins' structures. 3. Classifiers, for example, neural networks, decision trees etc. learn classification rules from the given training data and are used to predict protein function.
are: 1. The

The process of drug discovery, involves the prediction of protein function based upon existing facts. Sophisticated data mining models are needed for protein function prediction. Bioinformatics promises to reduce the labor, time as well as cost associated with this process.

1-4244-0716-8/06/$20.00 ©2006 IEEE.

564

Suppose the class label attribute has 'm' distinct values defining m distinct classes: Ci (for i = 1. The algorithm computes the information gain of each attribute. Let attribute A have v distinct values.e. S2.Attribute A can be used to partition S into v subsets. the best attribute for splitting).II PROBLEM STATEMENT AND SOLUTION APPROACH 2. used sequence derived features to predict HPF. For a wavelength of 280 nm. ExPASy ProtParam tool is used to obtain the sequence-derived feature called Extinction Coefficient which is a protein parameter that is commonly used in the laboratory for determining the protein concentration in a solution by spectrophotometry. Let si be the number of samples of S in class Ci. .e... (2002).. and the samples are partitioned accordingly [12].E(A) (3) The information gain measure is used to select the test attribute at each node in the tree. Pi is the probability that an arbitrary sample belongs to Class Ci and is estimated by si/s.Sv... The attribute with the highest information gain (or greatest entropy reduction) is chosen as the test attribute for the current node. Such a measure is referred to as an attribute selection measure or a measure of the goodness of split [2] [19]. + Sm]) / s)I(sIj. -Tyr and CCy.) in the protein [8]. where Sj contains those samples in S that have value aj of A. Such an information-theoretic approach minimizes the expected number of tests needed to classify an object and guarantees that a simple (but not necessarily the simplest) tree is found [19]. and lesser number of bits are assigned to a sequence of certain information than uncertain information.1. .. The smaller the entropy value. if the total information to be transmitted is divided into certain and uncertain.. lesser number of bits are needed to be transmitted over the communication channel [1]. Let S be a set consisting of s data samples.. The entropy. other sequence-derived features are abtained from the ExPASy ProtParam tool and others as shown in Table 2. Then.S21 . The idea was to integrate all protein features in order to predict protein function.3 HPF Prediction using SDFs L. {SI. a2. - E 7Trp8irp + T/yriyr + l7cys 8cys (4) i=l where. If A were selected as the test attribute (i.2 Attribute Selection in DTC E(A) =E v j=1 ((Silj . or expected information based on the partitioning into subsets by A.. then on an average. tyrosines(nTyr) and cystines(ncy. . Jensen. the Extinction Cofficient of a protein can be calculated from the number of tryptophans (nTrp)... the expected number of bits required to encode one value is the weighted average of the number of bits required to encode each possible value.av. The smaller the entropy value.. I(S1.. . where the weight is the probability of that value [1]: DTC (Decision Tree Classifier) methodology involves entropy calculation. For example. S2 . the greater the purity of the subset partitions. and let P(x) denote the probability of that value occurring. the greater is the purity of the subset partitions. The method used by the author includes the extraction of SDFs from a given set of amino-acid (protein) sequences using various web-based bioinformatics' tools. 2.. is given by [2]: According to Shannon. The author developed the data mining model for protein function prediction using neural networks as classifier.. Similarly.. m). A node is created and labeled with the attribute. + smj)/s) acts as the weight of the jth subset and is the number of samples in the subset (i... branches are created for each value of the attribute. Entropy is the expected information based on the partitioning into subsets by an attribute. This calculation is performed by the ExPASy ProtParam tool. Sm)... Sm) =-E m pi log 2(pi) (1) £protein where. according to Information Theory. et al. {a.1 Information Theory and Entropy correspond to the branches grown from the node containing the set S. The encoding information that would be gained by branching on A is [2]: Gain(A) = I(SI. The expected information needed to classify a given sample is given by [2]: Gain(A) is the expected reduction in entropy caused by knowing the value of attribute A. then these subsets would cT... Let the variable x range over the values to be encoded. sm]) (2) The term ((slj +. Let sij be the number of samples of class Ci in a subset Sj. 2. The attribute with the highest information gain is chosen as the test attribute for the given set S. This attribute minimizes the information needed to classify the samples in the resulting partitions and reflects the least randomness or "impurity" in these partitions.. having value aj of A) divided by the total number of samples in S. are the extinction coefficients of the individual amino-acid residues. 565 . It describes to what extent light is absorbed by the protein and depends upon the protein size and composition as well as the wavelength of the light.

post-translational modifications. TABLE 3. 3. It includes approximately 162 classes of protein functions. this technique involves the drawback of the overhead of creating package of SDFs by studying their dominancy for a particular molecular class. 3. The SDFs are preprocessed by placing their values in particular value ranges to make them suitable for input to classifier. packages of SDFs are obtained. From HPRD. The decision tree with maximum depth of eight is obtained by using this technique. All the information in HPRD has been manually extracted from the literature by expert biologists who read. It overcomes the limitation of the model involving packages of SDFs as it does not involve the overhead of creating packages of SDFs. Various web-based tools are then used to derive SDFs from these sequences. The technique does not encode the information in terms of bits. These packages are used to create various decision trees. DNA Repair Protein (DRP).* TABLE 2. charged charged NetNglyc NetOglyc NetPhos PSI-Pred PSORT N-glycosylation sites 0-glycosylation sites Sr and Thr phosphorylation Tyr phosphorylation Secondary Structure The implementation of the data mining model that creates decision trees on the basis of packages of SDF chosen. then it is considered as dominant for that class.2 Packages of SDFs For creating packages of SDFs: ProtParam: Nneg Npos Excl ExPASy Package of Features Pack 1 Pack 2 Pack 3 Pack 4 / X/ / / X X / / / Exc2 Instability Index Aliphatic Index GRAVY X/ X/ / / / / NetOGlyc: T X/ / / / / / / S NetPhos: Ser Thr / / / / / / / Tyr SignalP: mean S D V/ V/ V/ X/ / / / * The frequencies of values of SDFs are studied for each functional class. as it is not required in this application.1 PACKAGES OF SDFS OBTAINED III SOLUTION METHODOLOGY 3. The HPRD represents a centralized platform to visually depict and integrate information pertaining to domain architecture.1. The packages of SDFs obtained are shown in Table 3. The database provides information about protein function under the heading 'molecular class' covering all the major protein function categories. 566 Probability TMHMM: ExpAA PredHel / / / / / / / / / / / . But. These are: Defensin (Def). interaction networks and disease association for each protein in the human proteome.Cell Surface Receptor (CSR). the sequences related to five molecular classes are obtained. demonstrates that the use of more dominant SDFs in decision tree creation affects the depth of tree. interpret and analyze the published data.3 New Prediction Technique Subcellular Location Signal Peptide Transmembrane Helices SignalP TMHMM The new prediction technique incorporates the effect of choosing dominant SDFs for decision tree creation during entropy (or uncertainty) calculation itself. of positively residues On the basis of the dominancy.1 SDFS OBTAINED FROM VARIOUS WEB TOOLS Tool used * ExPASy ProtParam SDF Obtained Extinction Coefficient Hydrophobicity No.1 Data Collection and Preprocessing The actual data related to human protein is accessed from Human Protein Reference Database (HPRD). Heat Shock Protein (HSP) and Voltage Gated Channel (VGC). If a particular value of SDF repeats very highly for a particular molecular class. of negatively residues No.

tLtnbu.e. k is a constant For k = 1. v is the total number of values of attribute A and m is the total number of classes This factor indicates the entropy (or uncertainty) caused due to the creation of subset for a value of an attribute. j is a particular value of attribute A. l+Sp Su (1 1) where. i is a particular molecular class. i. (6) B) Specificity (or certainty) of a value of an attribute for a particular class (Sp): The ratio of the number of samples of class having value j of attribute A to the total number of samples of that particular class. certainty is high and thus uncertainty is low. If Sp is high. (i. + Smj) / S) v j=1 (5) U=-i. SP with value greater than zero. indicates the specificity (or dominancy) of a value for the class.... indicates the uncertainty due to subset creation [2].. (9) Figure 3. Su. i. Mathematically: where. SP is specificity (or certainty) of an attribute-value for a particular class. U= Su A) Uncertainty due to subset creation ( Sn): The ratio of the number of samples of all classes having value j of attribute A to the total number of samples in S. v j=1 E. Mathematically: The attribute with the least uncertainty measure is chosen as the best attribute by the new prediction technique during decision tree creation. we get: (8) uasu / s. If Su is high.The technique considers measuring uncertainty: following factors for or.e. v .e. v is the total number of values of attribute A and m is the total number of classes This factor indicates certainty of a particular functional class for an attribute value. j is a particular value of attribute A.. + Smi) / S) 1+ j=l i=l Z(si/jlsi) m (12) U a Su Where U indicates uncertainty measure.. Thus: k + Sp (10) Su= Z ((Slj + S2j+ .1 shows the flowchart for best attribute selection on the basis of least uncertainty measure by new prediction technique..e..1 Flowchart for Best Attribute Selection by New Prediction Technique 567 . is uncertainty due to subset creation. Figure 3.. Sp = E (Si / Si) j=l i=l v m (7) x 1Utd IA ax U A.. 1to u(a l/sp Combining equations (6) and (8).. Sp > 0) can only contribute to the calculation of uncertainty measure...((Sl .. uncertainty is high.lc ]ext where.

19. Kohavi and Y. The greater depth of the decision tree provides greater number of tests before functional class assignment and hence provides more accurate prediction results. [15] J. The model creates better-quality decision tree (in terms of depth) and hence ensures more accurate predictions than the existing methodology. Gupta. Boeckmann. CNB-CSIC. National Centre for Biotechnology. [2] J. et al.D. 6.. The steps required for the use of new prediction technique by the drug discoverer are clearly demonstrated. pp. Confi Machine Learning. Adriaans and Zantinge. Elomaa "In Defense of C4. The depth of the decision tree is directly proportional to the Sp (i. Akiba. "The SWISS-PROT protein sequence database and its supplement TrEMBL" Nucleic Acids Res. "Development and Applications of Decision Trees". 319(5).Training Data The training data consists of labeled feature vector related to five molecular classes of HPRD. Systems. specificity or certainty of an attributevalue for a particular class) on the tree creation. The new prediction technique provides the decision tree with better quality (in terms of depth) than the existing methodology. [9] D. SDFs are obtained from these sequences and are processed in a form suitable for input to classifier by placing them in particular value ranges.1994 pp. A. 2001. R. pp. [16] H. Spain. Almuallim. Kohavi and R. [11] R. coli genomes using data mining" Comparative and Functional Genomics. AAAI Press and the MIT Press. Jensen. Data Mining. "Prediction of Human Protein Function from Post-Translational Modifications and Localization Features" Journal of Molecular Biology. A. pp 920-926. Depth of Decision Tree a Sp (13) Greater the value of Sp for a particular value of an attribute. Information and Computer Science Department. 1374-1379. more is the chance of developing a branch from that value to a functional class. The implementation of the model shows that the computation of uncertainty measure by the new prediction technique is superior than the entropy calculation in the existing technique. Clare and L. tuberculosis and E. [18] J. E-28049.5 data mining algorithm applied on SWISS-PROT" Bioinformatics. 568 . 542-543. R. 2004. 17. 2003. pp 635-642.pp 8 1-106. 2002. Man and Cybernetics. and S. [14] D. Fleischmann and R. Kaneda. "Induction of decision trees" Machine Learning. Morgan Kaufmann Publishers. Kretschmann. 1993. Karwath.e. pp 62-69. There is large scope for application of the new prediction technique in drug discovery process due to its better quality and clear representation of the learned classification rules. Almuallim. Yun "Lazy decision trees" in Proceedings of the Thirteenth National Conference on Artificial Intelligence. Computer Science Department". 1. [13] P. Benjamin Cumming: 2002. thesis 2002. 2000. Morgan Kaufmann. Friedman. Through web-based bioinformatics' tools. The new decision tree induction technique provides tree with greater depth than the existing methodology due to the consideration of Sp. [10] E. Concepts and Techniques. A. 11. et al. Pearson Education: 2002. Quinlan "Decision Tree Discovery" Data Mining. VI REFERENCES [1] D. [12] R. Landgrebe "A Survey of Decision Tree Classifier Methodology" IEEE Trans. Jensen "Prediction of Protein Function from Sequence Derived Protein Features" Ph. the specificity of a value of an attribute for a particular class). Fundamental Concepts of Bioinformatics. 2000 pp. [6] R. Dehaspe "Accurate prediction of protein functional class in the M. Raymer. A high value of Sp lowers the value of uncertainty measure (as uncertainty measure a 1/ Sp) and thus contributes in the best attribute selection for tree creation. [5] T. PA 15213: 2004. pp. Tamames et al. D. Krane and M. Kamber Data Mining. Drug discoverers can easily use the model for predicting functions of proteins that are responsible for various diseases in human body. Technical University of Denmark. [17] J. Apweiler "Automatic rule generation for protein annotation with the C4. Carnegie Mellon University.1998. 10-18.S. et al. pp 1257-65. et al. Safavian and D. Devos and A. Quinlan. "Prediction of human protein function according to Gene Ontology Categories" in 2003 proceedings ofBioinformatics. Y. and M.5: Notes on Learning onelevel Decision Trees" in 2003 Proceedings of 11th Intl. 283293. The processed SDFs and the molecular classes are then combined to form labeled feature vector (or training data). Han.pp. [7] L. W. Jensen. Pittsburgh. IV RESULTS AND DISCUSSION The implementation of the model demonstrates the influence of Sp (i. [4] B. Bairoch. V CONCLUSION The data mining model for HPF prediction provides better classification rules for the same training data than the existing technique. [3] H. "EUCLID: Automatic classification of proteins in functional classes by their database annotations" Bioinformatics. R. 2003 'pp 365-370. 17. "On handling treestructured attributes in decision tree learning" in Proceedings of the 12th International Conference on Machine Learning (ICML95). [19] S. Protein Design Group". H. Five sequences of each molecular class is accessed from HPRD. Touretzky "Basics of Information Theory. 717-724. 2000. Madrid.1991. King. 31(1). [8] L. 660-674. 21(3). Valencia "Practical Limits of Function Prediction.e.