You are on page 1of 43

MS(CS) Final Thesis

Muhammad Rehan 1358-MSCS-08 Department of Computer Science GC University Lahore

       

Background Literature Review Problem Statement Hypothesis Methodology Result Future Work References

 Bioinformatics
 Any application of computation in biology
including data management, developing algorithm and data mining.  Bioinformatics is the field of science in which computer science, information technology, statistics and various branches of biology merge to from single discipline.

What are Proteins


Enzymes

Antibody

Proteins

Hormones

Structural Support & Transportation

What are Proteins made of


R R R Amino group H2N C H Carboxyl group COOH Alpha carbon H CH3

General formula of Amino Acid

Sr. No 1 2 3 4 5 6 7 . . 20

Amino Acid

Single Letter Code A R S T Y G H . . L

Three Letter Code Ala Arg Ser Thr Tyr Gly His . . Ieu

Alanine Arginine Serine Threonine Tyrosine Glycine Histidine . . Ieucine

 Experimental Approach
 vivo  vitro  silico

 Post Translational Modification(PTMs)


 Protein modification is very important for
biological activity and perform the desire task.  This modification is done by the addition of phosphate, glycosyl or other groups to certain amino acids.

PTM Phosphorylation Glycosylation Sulfation Acetylation Methylation R

Target Amino Acid S, T, Y, H S, T, N Y

Description Addition of a phosphate group, to S, T, Y, H Addition of a glycosyl group to either S, T, N. Addition of a sulfate group to a Y Addition of an acetyl group, usually at the N-terminus of the protein Addition of a methyl group, usually at or R residues

List of PTMs Types

Database Name PROSITE

Description Reference Database of consensus patterns for Sigrist et al., (2002) various PTMs Human protein reference database of Peri et al., (2003) disease-related proteins and their PTMs Database with collection of Garabelli (2003) annotations and structures for PTMs Database of phosphorylation sites. validated Diella et al., (2004)

HPRD

RESID

PhosphoBase ELM

List of PTMs Databases

Sr. No
1

Statement
Proteins often perform diverse and multiple functions. The diversity of proteome is higher complex then genome, in human genome the number of genes are 22,000 to 25,000 but in contrast number of proteins more than 10,0000 To identify the proteins functions and events mainly rely on their particular 3-D structure as well as the occurrence of targeted amino acid modification. PTMs regulate various functions of proteins by effecting verificational changes such as enzymes activation. Phosphorylated serine, theronine and tyrosine residues using MS is not easy in VIVO.

References
(Jeffery ,1999)

(Nicolle H et al ,2007)

(Attwood ,2000)

(Konstantinopoulos et al ,2007)

(Mann et al ,2002)

Sr. No
6

Statement
Many methods have been developed within the field of proteomics but these methods are still in early stages.

References
(Blom ,2004)

Application of machine learning and statistics in bioinformatics have always played a core role in understanding proteomics and to analysis of PTMs. ANN is one such Approach that has been extensively used in biological sequences analysis.

(Qazi et al,2006)

(Wu, C.H ,1997)

Mostly cellular proteins are regulated by reversible phosphorylation and at least 30% of protein have such alteration.

(Ficarro et al ,2002)

      

DISPHOS PredPhosPho GPS PPSP KinasePhos 1.0, KinasePhos 2.0 NetPhos, NetPhosK Neural-genetic

Tools Method Serine Threonine Tyrosine

DISPHOS Logistic regression 76% 81% 83%

NetPhos ANN 69% 72% 61%

Neuralgenetic ANN 75% 82% 79%

BPNN ANN 72% 77% 74%

Tools Method Kinase PKA Kinase PKC

KinasePhos 2.0 SVM Sn=92% Sp=89% Acc=90% Sn=84% Sp=86% Acc=85%

KinasePhos PredPhospho 1.0 HMM Sn=91% Sp=86% Acc=85% Sn=80% Sp=87% Acc=83% SVM Sn=88% Sp=%91 Ac=90% Sn=79% Sp=86% Ac=83%

GPS GPS Sn=91% Sp=89% Acc=90% Sn=82% Sp=83% Acc=82%

PPSP BDT Sn=90% Sp=92% Acc=91% Sn=82% Sp=86% Acc=84%

 Develop a new method BINS to evolve new


classification model by learning amino acid sequences data using machine learning based method artificial neural network. This BINS improve the prediction specificity, efficiency and accuracy for machine learning simulator called GEARS (Genetic Evaluation of Classifier by Learning Residue Rules and Sequences).

 BINS classification method will reduce the

false negative and positive prediction.  BINS method show highly accuracy prediction about PTMs which will affect the specific site and kinases that act at each site, disclose the important biologically information from noisy data.  BINS method can gives the best result as compare to the existing PTMs prediction methods.

 Empirical research methodology with


Exploratory Development Life Cycle will be used for the development of BINS Model.

 BINS consists mainly on three parts


 BINS Data Preparation Module  BINS Bootstrapping Module  BINS ANN Module

BINS Data Preparation Module Create Protein grouped by target classes

PTMs Database Removed of duplicate instance


Create Protein Database grouped by non modified target classes Peptide Generator

BINS Bootstrapping Module Peptide dataset grouped by non modified classes Peptide dataset grouped by non modified classes BINS ANN Module Topology and Network Configuration Sparse Encoding Merge the Sparse Encoding dataset grouped by modified and non modified target classes Training [SN] [SP] [Acc] [MCC] Validation Training and validation Dataset Generator Validation dataset Generator Training dataset Generator [SN] [SP] [Acc] [MCC]

 BINS Data Preparation Module


 BINS Database Inconsistency Analyzing Utility  BINS Balance Inverted Site Application  BINS Peptide Extraction Application

 BINS Data Preparation Module


PID O08539 O14543 O14746 O14920 O15117 Sequences Position Amino Acid S T S Y S Modification S T S Y S ASTSMNSY 4 TLKSYA. MVTHSKFP 3 AAGS. MPRAPRC RAVSTA MSWYPSL TQTC. 11 4

ELSFKQGE 3 QIYTA.

 BINS Data Preparation Module


Target No. of No. of sites Proteins positive Peptide No. of Negative Peptide No. of Balance negative Peptide
14837 2983 2325

No. of merge pos and balance neg pep


29304 5890 4533

S T Y

5431 1940 1156

14467 2907 2208

326396 35795 16273

 BINS Data Preparation Module


 BINS Database Inconsistency Analyzing Utility
PID O08539 O14543 O14746 O14920 O15117 Sequences Position Amino Acid S T S Y S Modification S T S Y S Length 350 1030 1250 735 952 ASTSMNSY 4 TLKSYA. MVTHSKFP AAGS. MPRAPRC RAVSTA 3 11

MSWYPSLT 4 QTC. ELSFKQGE QIYTA. 3

 BINS Data Preparation Module


 BINS Invert Application
PID O08539 O08539 O08539 Sequences Position Amino Acid S S S Modification S S S Length 350 350 350 ASTSMNSY 2 TLKSYA. ASTSMNSY 7 TLKSYA. ASTSMNSY 12 TLKSYA.

 BINS Data Preparation Module


 BINS Peptide Extraction Application
Peptide ID Extend ed Seque nces Class P-10 P-9 P-8 P0 P9 P10

O08539 -2

-,-,0.1 ,A,S,T, S,M,N S,Y,T,L K,S -,-,0.1 ,A,S,T, S,M,N, S,Y,T,L ,K,S,Y A,-,-,

O08539 -7

 BINS Bootstrapping Module


 BINS Training Dataset Encoding Manager  BINS Data Table Merging Utility  BINS Boot Strapping Application

 BINS Bootstrapping Module


 Sparse Encoding Scheme
Amino Acid
A C D E F G H I

Coding Scheme
10000000000000000000 01000000000000000000 00100000000000000000 00010000000000000000 00001000000000000000 00000100000000000000 00000010000000000000 00000001000000000000

. .
-

. .
00000000000000000000

 BINS Bootstrapping Module


 BINS Training Dataset Encoding Manager
Peptide ID Extend ed Seque nces Class P-10- P1 10-2 P10-3 P108 P10-9 P1010

O08539 -2

-,-,0.1 ,A,S,T, S,M,N S,Y,T,L K,S -,-,0.1 ,A,S,T, S,M,N, S,Y,T,L ,K,S,Y A,-,-,

O08539 -7

 BINS Bootstrapping Module


 BINS DataTable Merging Utility
Peptide ID Extend ed Seque nces Class P-10- P1 10-2 P10-3 P108 P10-9 P1010

O08539 -2

-,-,0.1 ,A,S,T, S,M,N S,Y,T,L K,S -,-,0.9 ,A,S,T, S,M,N, S,Y,T,L ,K,S,Y A,-,-,

O08539 -7

 BINS Bootstrapping Module


 BINS Boot Strapping Application

 BINS ANN Module

 Evaluation Strategy
 Sn=TP/(TP+FN)  Sp=TN/(TN+FP)  Acc=(Sn+Sp)/2  MCC=

 Evaluation Strategy
PID Sequence Position Target Clarify

O3265 O3265 O3265 O3265 O3265

SASNSTSYTS SASNSTSYTS SASNSTSYTS SASNSTSYTS SASNSTSYTS

3 10 1 5 7

Mod Mod

TP FN

Non-mod TN Non-mod FP Non-mod TN

BINS Serine Result


Sr. No 1 2 3 4 5 6 Training
Ac 0.965 0.984 0.996 0.995 0.998 0.996 Sn 1 0.982 0.996 0.995 0.998 0.996 Sp 0.931 0.987 0.996 0.996 0.998 0.996 MCC 0.932 0.969 0.992 0.991 0.996 0.992 Ac 0.497 0.805 0.807 0.807 0.807 0.809

Validation
Sn 0 0.612 0.619 0.622 0.616 0.628 Sp 1 0.999 0.995 0.995 0.999 0.991 MCC None 0.662 0.663 0.664 0.665 0.663

BINS Threonine Result


Sr. No 1 2 3 4 5 6 7 8 9 10 Training
Ac 0.972 0.987 0.987 0.987 0.987 0.986 0.988 0.989 0.990 0.990 Sn 0.963 0.986 0.986 0.986 0.986 0.986 0.990 0.989 0.990 0.989 Sp 0.981 0.989 0.988 0.987 0.987 0.986 0.987 0.990 0.990 0.991 MCC 0.946 0.975 0.974 0.974 0.974 0.972 0.977 0.979 0.980 0.980 Ac 0.826 0.834 0.825 0.827 0.824 0.822 0.825 0.823 0.822 0.821

Validation
Sn 0.688 0.737 0.750 0.771 0.774 0.772 0.761 0.768 0.770 0.770 Sp 0.965 0.932 0.901 0.884 0.875 0.872 0.890 0.880 0.874 0.871 MCC 0.680 0.683 0.658 0.659 0.653 0.648 0.657 0.652 0.647 0.645

BINS Tyrosine Result


Sr. No 1 2 3 4 5 6 7 8 9 10 Training
Ac 0.966 0.972 0.977 0.976 0.975 0.973 0.974 0.974 0.974 0.977 Sn 0.952 0.961 0.975 0.975 0.973 0.970 0.971 0.973 0.974 0.974 Sp 0.979 0.983 0.979 0.977 0.976 0.976 0.977 0.975 0.975 0.980 MCC 0.933 0.945 0.955 0.953 0.950 0.947 0.948 0.948 0.949 0.955 Ac 0.846 0.843 0.836 0.837 0.831 0.829 0.828 0.828 0.826 0.825

Validation
Sn 0.735 0.741 0.778 0.780 0.779 0.779 0.778 0.779 0.778 0.768 Sp 0.951 0.939 0.891 0.890 0.881 0.877 0.876 0.875 0.872 0.879 MCC 0.705 0.697 0.675 0.676 0.665 0.661 0.659 0.659 0.654 0.652

BINS Comparison with other Method


Algorithm
Acc
BINS NetPhos DISPHOS BPNN Neural-genetic

Y
Sn 74% 70% NA 75% 81% Sp 95% 68% NA 75% 78% Acc 83% 72% 81% 78% 83%

T
Sn 74% 66% NA 78% 81% Sp 93% 77% NA 77% 84% Acc 81% 69% 76% 72% 75% Sn

S
Sp 99% 57% NA 72% 74%

85% 69% 83% 75% 79%

63% 81% NA 72% 76%

BINS is a developed as Desktop Application, technically, there is no online WWW support available in the current version, nevertheless, increasing opportunities over the internet urges the need to develop an online version of this application for its wider scope and availability to multiple clients in different regions of the world. This effort would not only help us to enhance the embedded capability of BINS for efficient PTMs but also could be major resource for multi-nation research collaborations. BINS are the sub module of GEARS so in next version learn and optimize the parameters and weights of ANN with genetic algorithm. In next, BINS integrate with other GEARS modules like MAPRes and HMM for best classification of proteins data using pros and cons of each technique.

 Jeffery C.J. Moonlighting proteins, Trends Biochem. Sci., 24:8-    

11, 1999. Bork P., Dansekar T., Diaz-Lazcoz Y., Eisenhaber F., Huynen M. and Yuan Y. Predicting function: from genes to genome and back. J. Mol. Biol., 283:707--725, 1998. Attwood T. The quest to deduce protein function from sequence: the role of pattern databases, Int. J. Biochem. Cell Biol., 32:139-155, 2000. Mann, M., Ong, S., Gronborg. M, .Steen, H. et al., Trends Biotechnol. 2002, 20, 261-268. Wu, C. H., Comput, Chem, 1997, 21, 237-256. Blom N., Sicheritz-Protein T., Gupta R., Gammeltoft S., and Brunak S. Prediction of post-translational glycosylation and phosphorylation of proteins from the amino acid sequence, Proteomics, 4: 1633--1649, 2004.