Professional Documents
Culture Documents
Protein sequence
-large numbers of
sequences, including
whole genomes
?
Knowledge of protein function is important in-
- rational drug design and treatment of disease
- protein and genetic engineering
- build networks to model cellular pathways
- study organismal function and evolution
What is the function of protein?
Full Genomes
(Genome
Databases)
Gene
Prediction
Protein Putative
Databases Proteins
(Known) (Unknown)
Protein
Search Annotated
Function
Method Proteins
Prediction
a protein whose existence has been predicted but for that protein there is no experimental
evidence. Putative protein :identified by Bioinformatics tools and servers
Classification of Protein Databases
Structure Pattern
Database : PDB & CATH etc. Database : PROSITE & Pfam etc.
Algorithms: Threading Algorithms: Motifs & InterPro
Sequence Homology Search
ispe hi saari game hai
Algorithms for Prediction of Protein Function
Link Functionally Related Proteins by
Genomes
P2
P2 P4 P3 Phylogenetic Profiles
P5 P5
P1 EC SC BS HI
P7 P6
P7 P1 1 1 0 1
S. cerevisiae (SC) P2 1 1 1 0
B. subtilis (BS)
P3 1 0 1 1
P4 1 1 0 0
P1 P2 P3 P4
P1 P3 P5 1 1 1 1
P5 P6 P7
P6 P5 P6 1 0 1 1
E.coli (EC) H. influenzae P7 1 1 1 0
(HI)
Pellegrini et al, Proc. Natl. Acad. Sci. USA (1999) 96, 4285
Profile Clusters
P4 1 1 0 0
P2 1 1 1 0
P7 1 1 1 0
P1 1 1 0 1 P5 1 1 1 1
P3 1 0 1 1
P6 1 0 1 1
Overbeek et al, Proc. Natl. Acad. Sci. USA (1999) 96, 2896
Functional clusters in the glycolysis pathway
Using Protein Fusions to Predict Protein-Protein
Interactions
Fused A-B
Domains
A
Two objectives:
(1) to identify distant relatives of the MJ0577 family and
(2) to gain insight into the function of this family of proteins.
Database: ‘nr’
Expect: 1
Low complexity filter: On
Matrix: BLOSUM62
Descriptions: 250
Alignments: 100
Inclusion Threshold: 0.001
Examine Descriptions
In a PSI-BLAST search, hits are divided into two categories:
those with E better than threshold and others
• Those that are better than the E value threshold (listed first)
• Those with E values worse than threshold, but nonetheless
have an E value better than 1 (selected on the query page) are
listed further down the page
Hits with E values better than the threshold are used in forming the
profile that will be used in subsequent PSI-BLAST iterations
for example:
MJ0577 shows similarity with the 780 aa cationic amino acid
transporter. On reverse BLAST :
• similarity to aa transporters resides between aa 45 and
aa 440, for the most part.
• Modest similarity to MJ0577 is localized to a separate
region of the protein, aa 550 - aa 720.
MJ0577 is probably a member of the Universal Stress Protein
Family or putative transcriptional regulator. Because,
• Similarity is reasonable
• Alignments are respectable
The fact that MJ0577 and uspA each bring up the other as a high
scoring hit in a PSI-BLAST search lends confidence to the
hypothesis that MJ0577 may be a distant relative of the uspA
family.
Second, Third & subsequent iterations
In the subsequent PSI-BLAST iterations the E-value
keeps on increasing and finally converges:
BLAST hit 2e-14
1st 6e-14
2nd 8e-38
3rd 3e-41
4th 1e-40
5th 4e-41
6th 3e-41
3rd, 4th, 5th, & 6th imply that the E-value has started
converging and has finally converged for UspA.
Therefore, MJ0577 may be a distant relative of the uspA
family.
Other Evidences
Domains, on the other hand, are regions of a protein that has a specific function and can (usually)
function independently of the rest of the protein.
A protein domain is a conserved part of a given protein sequence and tertiary structure that can
evolve, function, and exist independently of the rest of the protein chain. Each domain forms a
compact three-dimensional structure and often can be independently stable and folded.
Function Prediction:pattern matching
Protein Sequence Motifs
PROSITE
Biologically-significant protein patterns and profiles
http://www.expasy.ch/prosite/
Pfam
Multiple sequence alignments and hidden Markov models of common
protein domains
http://www.sanger.ac.uk/Software/Pfam/
PRINTS
Protein sequence motifs and signatures
http://bioinf.man.ac.uk/dbbrowser/PRINTS/
COGs
Phylogenetic pattern based
http://www.ncbi.nlm.nih.gov/COG/
BLOCKS
Protein sequence motifs and alignments
http://www.blocks.fhcrc.org/
Function Prediction:pattern matching
Function Prediction:pattern matching
PROSITE database
• In protein sequence families, some regions have been found better conserved
than others during evolution.
• These regions are generally important for the function of a protein and/or for
the maintenance of its 3-dimensional structure.
• Try to find a short (not more than four or five residues long)
conserved sequence in this region (called ‘core’ pattern)
• Scan recent version of the SWISS-PROT knowledgebase with
these core pattern(s)
• If a ‘core’ pattern detects all the proteins under consideration and
none (or very few) of the other proteins, we can stop at this stage
and use the core pattern as a bona fide signature
Function Prediction:pattern matching
?
Methodology for the development
of profile entries
Function Prediction:pattern matching
Input example:
>gi|21264520 Tyrosine-protein kinase SRC-1 (p60-SRC-1)
MGATKSKPREGGPRSRSLDIVEGSHQPFTSLSASQTPNKSLDSHRPPAQPFGGNC
DLTPFGGINFSDTITSPQRTGPLAGGVTTFVALYDYESRTETDLSFKKGERLQIVN
NTEGDWWLARSLSSGQTGYIPSNYVAPSDSIQAEEWYLGKITRREAERLLLSLE
NPRGTFLVRESETTKGAYCLSVSDYDANRGLNVKHYKIRKLDSGGFYITSRTQFI
SLQQLVAYYSKHADGLCHRLTTVCPTAKPQTQGLSRDAWEIPRDSLRLELKLGQ
GCFGEVWMGTWNGTTRVAIKTLKPGTMSPEAFLQEAQVMKKLRHEKLVQLYA
VVSEEPIYIVTEYISKGSLLDFLKGEMGRYLRLPQLVDMAAQIASGMAYVERMN
YVHRDLRAANILVGENLVCKVADFGLARLIEDNEYTARQGAKFPIKWTAPEAAL
YGRFTIKSDVWSFGILLTELTTKGRVPYPGMVNREVLDQVERGYRMPCPPDCPE
SLHDLMFQCWRKDPEERPTFEYLQAFLEDYFTATEPQYQPGDNL
Function Prediction:pattern matching
Output
gi-21264520.pep Check: 6666 Length: 532 ! gi|21264520 Tyrosine-protein kinase SRC-1
Protein_Kinase_Atp (L,I,V)G~(P)G~(P)(F,Y,W,M,G,S,T,N,H)(S,G,A)~(P,W)(L,I,V,C,A,T)
~(P,D)x(G,S,T,A,C,L,I,V,M,F,Y)x{5,18}(L,I,V,M,F,Y,W,C,S,T,A,R)(A,I,V,P)(L,I,V,M,F,
A,G,C,K,R)K(L)G~PG~P(F)(G)~(P,W)(V)~(P,D)x(G)x{7}(V)(A)(I)K
PROSITE allows the following pattern elements
For example:
[FILV]Qxxx[RK]Gxxx[RK]xK{ST}x[FILVWY]{P}
where
x signifies any amino acid,
square brackets indicate an alternative
A string of characters drawn from the alphabet and enclosed in braces (curly
brackets) denotes any amino acid except for those in the string. For
example, {ST} denotes any amino acid other than S or T.
Input example:
Output
Starting search. Estimated time: 14 seconds (assuming all Wulfpack nodes are running).
Please wait...
Pfam HMM search results, glocal+local alignments merged (Pfam_ls+Pfam_fs)
[Go here for an explanation of the format of the results]
Description:
Function Prediction :Pattern Matching
• PROSITE
• PRINTS
• PDB
• BLOCKS
• SMART
• PRODOM
• InterPro &
• CDD
Function Prediction :Pattern Matching
• Overall, the database is still relatively small, largely because the detailed
annotation of fingerprints is so time-consuming. Nevertheless, the
additional, unique information contained in PRINTS makes this database
a useful adjunct to PROSITE, Pfam, Blocks, etc..
Function Prediction :Pattern Matching
Input example:
Output
Function Prediction :Pattern Matching
• TMAP www.mbb.ki.se/tmap/
• Structural Databases
Relationship between sequence and structure
Example:
Glycine takes on a special position, as
it has the smallest side chain, only
one hydrogen atom, and therefore
can increase the local flexibility in the
protein structure.
• Comparative modeling
– Clear sequence relationship with an experimentally
determined structure
• Fold recognition
– No easily detectable sequence relationship
– But the fold is known
• Ab initio prediction
– No direct use of known structures
Comparative Modeling
Tertiary Structure Prediction Methods
Structure selection
Structure refinement
Final Structure
Comparative Modeling
• Loops
– Database methods and local ab initio approaches
• Refinement
– Main chain conformation produced by copying the
template is approximate (rmsd ~1.5 Å)
– Need some version of energy minimisation
– Still very poor at prediction
Rotamers are usually defined as low energy side-chain conformations. The use of a
build-library of rotamers allows anyone determining or modeling a structure to try the
most likely side-chain conformations, saving time and producing a structure that is more
likely to be correct.
Looking at Structures: Resolution
• Resolution is a measure of the quality of the data that has been collected on the
crystal containing the protein or nucleic acid.
• If all of the proteins in the crystal are aligned in an identical way, forming a very
perfect crystal, then all of the proteins will scatter X-rays the same way, and the
diffraction pattern will show the fine details of crystal. On the other hand, if the
proteins in the crystal are all slightly different, due to local flexibility or motion,
the diffraction pattern will not contain as much fine information.
• So resolution is a measure of the level of detail present in the diffraction pattern
and the level of detail that will be seen when the electron density map is
calculated.
• High-resolution structures, with resolution values of 1 Å or so, are highly ordered
and it is easy to see every atom in the electron density map.
• Lower resolution structures, with resolution of 3 Å or higher, show only the basic
contours of the protein chain, and the atomic structure must be inferred. Most
crystallographic-defined structures of proteins fall in between these two extremes.
• As a general rule of thumb, we have more confidence in the location of atoms in
structures with resolution values that are small, called "high-resolution structures".
Target-Multiple Template Alignment