You are on page 1of 52

Structural and Functional Bioinformatics

Lecture 3
For 3D structure prediction there exist two basic
approaches:
(1)compare the structure with proteins with known
structure, or
(2)to predict the structure just from the sequence
including physical laws and empirical knowledge.

Item (1) can be subdivided into comparative modeling by


(1.1) sequence-sequence comparison (alignment) and
comparative modeling by (1.2) sequence-structure
comparison (threading).
If after a global sequence alignment the identity between
the proteins is 25-45%, then the two structures are similar.
When the similarity is about 45%, then structures are equal,
i.e. their structure match exactly.

For (1.1) alignment methods like BLAST are used and for
(1.2) different threading methods are introduced.

Threading methods put a new sequence on a known


structure and compute how well the new sequence fits the
known structure, e.g. how many hydrophobic amino acids
are buried.
Item (2) includes “ab initio prediction” and molecular
modeling or quantum mechanical modeling.
Tertiary structure

• Tertiary structure refers to three-dimensional structure of


a single protein molecule. The alpha-helices and beta-
sheets are folded into a compact globule.

• The folding is driven by the non-specific 


hydrophobic interactions (the burial of 
hydrophobic residues from water), but the structure is
stable only when the parts of a protein domain are
locked into place by specific tertiary interactions, such
as salt bridges, hydrogen bonds, and the tight packing of
side chains and disulfide bonds.
PROTEIN MODELING

• Predicting the 3D structure of a protein


from its amino acid sequence
Computational methods for Protein
Modeling

Homology or Comparative Modeling

Fold Recognition or threading Methods

Ab initio methods that utilize knowledge-based


information or without the aid of knowledge-based
information
Why do we need computational
approaches?
The goal of research in the area of structural genomics is to provide the means to characterize and
identify the large number of protein sequences that are being discovered

Knowledge of the three-dimensional structure


 helps in the rational design of site-directed mutations
 can be of great importance for the design of drugs
 greatly enhances our understanding of how proteins function and how they interact with each
other , for example, explain antigenic behaviour, DNA binding specificity, etc

Structural information from x-ray crystallographic or NMR results


 obtained much more slowly.
 techniques involve elaborate technical procedures
 many proteins fail to crystallize at all and/or cannot be obtained or dissolved in large enough
quantities for NMR measurements
 The size of the protein is also a limiting factor for NMR

With a better computational method this can be done extremely fast.


Protein Homology modeling

• Homology modeling is an extrapolation of protein structure for a


target sequence using the known 3D structure of similar sequence as
a template.
• Basis: proteins with similar sequences are likely to assume same
folding
• Certain proteins with as low as 25% similarity have been observed to
assume same 3D structure

03/08/20 09:08
Bioinformatics how to

use publicly available free tools to predict protein structure by
comparative modeling
Proteins are 3D objects with
complex shapes

• Over 60,000 protein structures have been determined,


mostly by X-ray crystallography (PDB)
• 3D structure of ~70% of bacterial and 50% of human
proteins can be predicted (comparative modeling)
A predicted model simply illustrates
our assumptions

No assumptions, this
is nature telling us
how it is

GNAAAAKKGSEQESVKEFLAKAKEDFLKKWENPA
QNTAHLDQFERIKTLGTGSFGRVMLVKHKETGNH
FAMKILDKQKVVKLKQIEHTLNEKRILQAVNFPF
LVKLEYSFKDNSNLYMVMEYVPGGEMFSHLRRIG
RFSEPHARFYAAQIVLTFEYLHSLDLIYRDLKPE
NLLIDQQGYIQVTDFGFAKRVKGRTWTLCGTPEY
LAPEIILSKGYNKAVDWWALGVLIYEMAAGYPPF
FADQPIQIYEKIVSGKVRFPSHFSSDLKDLLRNL
LQVDLTKRFGNLKDGVNDIKNHKWFATTDWIAIY
QRKVEAPFIPKFKGPGDTSNFDDYEEEEIRVSIN
EKCGKEFSEF

Sequence Assumption Result


(protein A is Similar (protein A is Similar
to protein B) to protein B)
How do we know that these
proteins are similar?
Well studied protein
• Unknown protein • SRRSASHPTYSEMIAAAIRAEK
• GLLTTKFVSLLQEAKDGVLDL SRGGSSRQSIQKYIKSHYKVG
KLAADTLAVRQKRRIYDITNVL HNADLQIKLSIRRLLAA
EGIGLIEKKSKNSIQW similarity

prediction
How can we make such
assumptions?
• Statistical reliability of the prediction
• E-value - the number of hits one can "expect" to see just
by chance when searching a database of a particular size
(closer to zero the better)
• Z-score – score expressed as a distance from the mean
calculated in standard deviations (the bigger the better)
Similar, but not homologous

• Phosphoribosyltransferase and viral coat protein, identity: 42%,


different folds, different functions

• . . . . .
99 IRLKSYCNDQSTGDIKVIGGDDLSTLTGKNVLIVEDIIDTGKTMQTLLSLVRQY.NPKMVKVASLLVKRTPRSVGY 173
: ||. ||| || |. || | : | | | | || | || |:| | ||.| |
214 VPLKTDANDQ.IGDSLY....SAMTVDDFGVLAVRVVNDHNPTKVT..SKVRIYMKPKHVRV...WCPRPPRAVPY 279

Different, but homologous
• Histone H5 and transcription factor E2F4, identity 7%, similar fold, similar
function (DNA binding)

• PTYSEMIAAAIRAEKSRGGSSRQSIQKYIKSHYKVGHNADLQIKLSIRRLLAAGVLKQTKGVGASGSFRL
| | | | |
• GLLTTKFVSLLQEAKD-GVLDLKLAADTLA------VRQKRRIYDITNVLEGIGLIEKKS----KNSIQW
COMPARATIVE or HOMOLOGY MODELING
The aim is to build a 3-D model for a protein of unknown structure
(target) on the basis of sequence similarity to proteins of known
structure (templates).
COMPARATIVE or HOMOLOGY MODELING
The aim is to build a 3-D model for a protein of unknown structure
(target) on the basis of sequence similarity to proteins of known
structure (templates).

•Accuracy ranges from getting correct fold to high resolution


•The most accurate structure prediction method Why?
COMPARATIVE or HOMOLOGY MODELING
The aim is to build a 3-D model for a protein of unknown structure
(target) on the basis of sequence similarity to proteins of known
structure (templates).

•Accuracy ranges from getting correct fold to high resolution


•The most accurate structure prediction method Why?
-- 3D structures of proteins in a given family are
more conserved than their sequences
-- ~1/3 of all sequences are recognizably related to at
least one known structure
-- the number of unique protein folds is limited
Structure prediction flowchart:

Rob Russell http://speedy.embl-heidelberg.de/gtsp/


Next Class on Monday September
30
Steps

Identify template(s)
– Initial alignment
• Improve alignment
• Backbone
generation
• Loop modelling
• Side chains
• Refinement
• Validation
Steps in Homology Modeling

Figure 5.1.1 from MA Marti-Renom and A. Sali “Modeling Protein


Structure from Its Sequence” Current Prototocols in
Bioinformatics (2003). 5.1.1-5.1.32
Step 1: Template Selection
Explore the Following web links
for structure prediction exercise
Data Bases
GenBank www.ncbi.nlm.nih.gov/GenBank
GeneCensus bioinfo.mbb.yale.edu/genome
MODBASE http://modbase.compbio.ucsf.edu
PDB www.rcsb.org/pdb/
eMOTIF http://brutlag.stanford.edu/projects.html
UniProt http://www.uniprot.org/
Data Bases
GenBank www.ncbi.nlm.nih.gov/GenBank
GeneCensus bioinfo.mbb.yale.edu/genome
MODBASE http://modbase.compbio.ucsf.edu
PDB www.rcsb.org/pdb/
eMOTIF http://brutlag.stanford.edu/projects.html
UniProt http://www.uniprot.org/
Data Bases
GenBank www.ncbi.nlm.nih.gov/GenBank
GeneCensus bioinfo.mbb.yale.edu/genome
MODBASE http://modbase.compbio.ucsf.edu
PDB www.rcsb.org/pdb/
eMOTIF http://brutlag.stanford.edu/projects.html
UniProt http://www.uniprot.org/
Data Bases
GenBank www.ncbi.nlm.nih.gov/GenBank
GeneCensus bioinfo.mbb.yale.edu/genome
MODBASE http://modbase.compbio.ucsf.edu
PDB www.rcsb.org/pdb/
eMOTIF http://brutlag.stanford.edu/projects.html
UniProt http://www.uniprot.org/
Data Bases
GenBank www.ncbi.nlm.nih.gov/GenBank
GeneCensus bioinfo.mbb.yale.edu/genome
MODBASE http://modbase.compbio.ucsf.edu
PDB www.rcsb.org/pdb/
eMOTIF http://brutlag.stanford.edu/projects.html
UniProt http://www.uniprot.org/
Data Bases
GenBank www.ncbi.nlm.nih.gov/GenBank
GeneCensus bioinfo.mbb.yale.edu/genome
MODBASE http://modbase.compbio.ucsf.edu
PDB www.rcsb.org/pdb/
eMOTIF http://brutlag.stanford.edu/projects.html
UniProt http://www.uniprot.org/
Data Bases
GenBank www.ncbi.nlm.nih.gov/GenBank
GeneCensus bioinfo.mbb.yale.edu/genome
MODBASE http://modbase.compbio.ucsf.edu
PDB www.rcsb.org/pdb/
eMOTIF http://brutlag.stanford.edu/projects.html
UniProt http://www.uniprot.org/
Template Search
BLAST http://www.ncbi.nlm.nih.gov/BLAST/
FastA http://www.ebi.ac.uk/fasta33/
SSM http://www.ebi.ac.uk/msd-srv/ssm/
PredictProtein http://www.predictprotein.org/
123D; SARF2; PDP http://123d.ncifcrf.gov/
GenTHREADER http://bioinf.cs.ucl.ac.uk/psipred/
UCLA-DOE http://fold.doe-mbi.ucla.edu/

Core
Template Search
BLAST http://www.ncbi.nlm.nih.gov/BLAST/
FastA http://www.ebi.ac.uk/fasta33/
SSM http://www.ebi.ac.uk/msd-srv/ssm/
PredictProtein http://www.predictprotein.org/
123D; SARF2; PDP http://123d.ncifcrf.gov/
GenTHREADER http://bioinf.cs.ucl.ac.uk/psipred/
UCLA-DOE http://fold.doe-mbi.ucla.edu/

→ BLAST

The Basic Local Alignment Search Tool (BLAST) finds regions of


local similarity between sequences. The program compares
nucleotide or protein sequences to sequence databases and
calculates the statistical significance of matches.
Template Search
BLAST http://www.ncbi.nlm.nih.gov/BLAST/
FastA http://www.ebi.ac.uk/fasta33/
SSM http://www.ebi.ac.uk/msd-srv/ssm/
PredictProtein http://www.predictprotein.org/
123D; SARF2; PDP http://123d.ncifcrf.gov/
GenTHREADER http://bioinf.cs.ucl.ac.uk/psipred/
UCLA-DOE http://fold.doe-mbi.ucla.edu/

Fasta Protein Database Query 

Provides sequence similarity searching against nucleotide and protein databases


using the Fasta programs. Fasta can be very specific when identifying long regions
of low similarity especially for highly diverged sequences. You can also conduct
sequence similarity searching against complete proteome or genome databases
using the Fasta programs.
Template Search
BLAST http://www.ncbi.nlm.nih.gov/BLAST/
FastA http://www.ebi.ac.uk/fasta33/
SSM http://www.ebi.ac.uk/msd-srv/ssm/
PredictProtein http://www.predictprotein.org/
123D; SARF2; PDP http://123d.ncifcrf.gov/
GenTHREADER http://bioinf.cs.ucl.ac.uk/psipred/
UCLA-DOE http://fold.doe-mbi.ucla.edu/
Template Search
BLAST http://www.ncbi.nlm.nih.gov/BLAST/
FastA http://www.ebi.ac.uk/fasta33/
SSM http://www.ebi.ac.uk/msd-srv/ssm/
PredictProtein http://www.predictprotein.org/
123D; SARF2; PDP http://123d.ncifcrf.gov/
GenTHREADER http://bioinf.cs.ucl.ac.uk/psipred/
UCLA-DOE http://fold.doe-mbi.ucla.edu/
Template Search
BLAST http://www.ncbi.nlm.nih.gov/BLAST/
FastA http://www.ebi.ac.uk/fasta33/
SSM http://www.ebi.ac.uk/msd-srv/ssm/
PredictProtein http://www.predictprotein.org/
123D; SARF2; PDP http://123d.ncifcrf.gov/
GenTHREADER http://bioinf.cs.ucl.ac.uk/psipred/
UCLA-DOE http://fold.doe-mbi.ucla.edu/

Thread 1- to 3-D with 123D+

123D+ is a program which combines sequence profiles, secondary structure


prediction, and contact capacity potentials to thread a protein sequence
through the set of structures.
Template Search
BLAST http://www.ncbi.nlm.nih.gov/BLAST/
FastA http://www.ebi.ac.uk/fasta33/
SSM http://www.ebi.ac.uk/msd-srv/ssm/
PredictProtein http://www.predictprotein.org/
123D; SARF2; PDP http://123d.ncifcrf.gov/
GenTHREADER http://bioinf.cs.ucl.ac.uk/psipred/
UCLA-DOE http://fold.doe-mbi.ucla.edu/

PSIPRED - a highly accurate method for protein secondary structure prediction


MEMSAT2 - our widely used transmembrane topology prediction method and
GenTHREADER - a sequence profile based fold recognition method.
Template Search
BLAST http://www.ncbi.nlm.nih.gov/BLAST/
FastA http://www.ebi.ac.uk/fasta33/
SSM http://www.ebi.ac.uk/msd-srv/ssm/
PredictProtein http://www.predictprotein.org/
123D; SARF2; PDP http://123d.ncifcrf.gov/
GenTHREADER http://bioinf.cs.ucl.ac.uk/psipred/
UCLA-DOE http://fold.doe-mbi.ucla.edu/

FAD or NADP binding Rossmann fold detector


Fold Recognition
Internal Repeat Finder
MOMENT Transmembrane Helix Prediction
Motif-Based Fold Assignment
DPANN: Sequence to Structure Alignment
Profile Search Software: Bowie et al. 1991
Step 2: Sequence Alignment

!!
This is the most crucial step in the process.
The process of homology modeling can not
recover from a bad alignment.
Sequence Alignment
EMBOSS http://www.ebi.ac.uk/emboss/align/
Tcoffee http://www.igs.cnrs-mrs.fr/Tcoffee
ClustalW http://www.ebi.ac.uk/clustalw/
BCM http://searchlauncher.bcm.tmc.edu/multi-align/
POA http://www.bioinformatics.ucla.edu/poa/
STAMP http://www.ks.uiuc.edu/Research/vmd/
SwissModel http://www.expasy.org/spdbv/

Core
Sequence Alignment
EMBOSS http://www.ebi.ac.uk/emboss/align/
Tcoffee http://www.igs.cnrs-mrs.fr/Tcoffee
ClustalW http://www.ebi.ac.uk/clustalw/
BCM http://searchlauncher.bcm.tmc.edu/multi-align/
POA http://www.bioinformatics.ucla.edu/poa/
STAMP http://www.ks.uiuc.edu/Research/vmd/
SwissModel http://www.expasy.org/spdbv/

Pairwise Alignment Algorithms

This tool is used to compare 2 sequences. When you want an alignment that
covers the whole length of both sequences, use needle. When you are trying to
find the best region of similarity between two sequences, use water.
Sequence Alignment
EMBOSS http://www.ebi.ac.uk/emboss/align/
Tcoffee http://www.igs.cnrs-mrs.fr/Tcoffee
ClustalW http://www.ebi.ac.uk/clustalw/
BCM http://searchlauncher.bcm.tmc.edu/multi-align/
POA http://www.bioinformatics.ucla.edu/poa/
STAMP http://www.ks.uiuc.edu/Research/vmd/
SwissModel http://www.expasy.org/spdbv/

This program is more accurate than ClustalW for sequences with less than
30% identity, but it is slower...
Sequence Alignment
EMBOSS http://www.ebi.ac.uk/emboss/align/
Tcoffee http://www.igs.cnrs-mrs.fr/Tcoffee
ClustalW http://www.ebi.ac.uk/clustalw/
BCM http://searchlauncher.bcm.tmc.edu/multi-align/
POA http://www.bioinformatics.ucla.edu/poa/
STAMP http://www.ks.uiuc.edu/Research/vmd/
SwissModel http://www.expasy.org/spdbv/

ClustalW is a general purpose multiple sequence alignment program for DNA or


proteins. It calculates the best match for the selected sequences, and lines
them up so that the identities, similarities and differences can be seen.
Evolutionary relationships can be seen via viewing Cladograms or Phylograms
Sequence Alignment
EMBOSS http://www.ebi.ac.uk/emboss/align/
Tcoffee http://www.igs.cnrs-mrs.fr/Tcoffee
ClustalW http://www.ebi.ac.uk/clustalw/
BCM http://searchlauncher.bcm.tmc.edu/multi-align/
POA http://www.bioinformatics.ucla.edu/poa/
STAMP http://www.ks.uiuc.edu/Research/vmd/
SwissModel http://www.expasy.org/spdbv/
Sequence Alignment
EMBOSS http://www.ebi.ac.uk/emboss/align/
Tcoffee http://www.igs.cnrs-mrs.fr/Tcoffee
ClustalW http://www.ebi.ac.uk/clustalw/
BCM http://searchlauncher.bcm.tmc.edu/multi-align/
POA http://www.bioinformatics.ucla.edu/poa/
STAMP http://www.ks.uiuc.edu/Research/vmd/
SwissModel http://www.expasy.org/spdbv/

Multiple Sequence Alignment


Using Partial Order Graphs
Bioinformatics 2002 18:452-464Christopher Lee, Catherine Grasso, & Mark Sharlow
Sequence Alignment
EMBOSS http://www.ebi.ac.uk/emboss/align/
Tcoffee http://www.igs.cnrs-mrs.fr/Tcoffee
ClustalW http://www.ebi.ac.uk/clustalw/
BCM http://searchlauncher.bcm.tmc.edu/multi-align/
POA http://www.bioinformatics.ucla.edu/poa/
STAMP http://www.ks.uiuc.edu/Research/vmd/
SwissModel http://www.expasy.org/spdbv/

STAMP is a suite of programs for the comparison and alignment of protein


three dimensional structures. 
Sequence Alignment
EMBOSS http://www.ebi.ac.uk/emboss/align/
Tcoffee http://www.igs.cnrs-mrs.fr/Tcoffee
ClustalW http://www.ebi.ac.uk/clustalw/
BCM http://searchlauncher.bcm.tmc.edu/multi-align/
POA http://www.bioinformatics.ucla.edu/poa/
STAMP http://www.ks.uiuc.edu/Research/vmd/
SwissModel http://www.expasy.org/spdbv/

Deep View
Swiss-PdbViewer
by
Nicolas Guex, Alexandre Diemand , Manuel C. , &
Torsten Schwede

Threading
A.A. Substitution Matrix
A C D E F G H I K L M N P Q R S T V W Y
A 5 -2 0 1 -2 0 0 -1 0 -1 0 0 1 0 -1 1 0 0 -2 -2
C -2 8 -2 -3 -3 -2 0 -2 -3 -3 0 -2 -3 -3 -2 -1 -1 -2 -1 -2
D 0 -2 5 2 -2 0 1 -3 0 -2 -1 2 0 1 -2 0 0 -2 -3 -2
E 1 -3 2 5 -3 0 -1 -2 1 -2 -2 1 1 2 0 1 1 -1 -2 -1
F -2 -3 -2 -3 6 -3 1 0 -3 2 2 -3 -2 -3 -2 -1 -2 0 3 3
G 0 -2 0 0 -3 5 -1 -2 0 -2 -2 0 0 -1 0 0 -1 -1 -2 -3
H 0 0 1 -1 1 -1 5 -1 1 -1 0 1 0 1 2 0 1 -1 0 1
I -1 -2 -3 -2 0 -2 -1 5 -2 2 2 -2 -2 -3 -2 -1 0 2 9 0
K 0 -3 0 1 -3 0 1 -2 5 -1 -2 1 0 1 2 0 0 -1 -2 -2
L -1 -3 -2 -2 2 -2 -1 2 -1 5 3 -2 -2 0 -1 -1 0 2 0 0
M 0 0 -1 -2 2 -2 0 2 -2 3 5 -1 -2 0 -2 -1 0 1 -2 -1
N 0 -2 2 1 -3 0 1 -2 1 -2 -1 5 -2 1 0 2 0 -2 -3 -1
P 1 -3 0 1 -2 0 0 -2 0 -2 -2 -2 8 0 0 0 0 -1 -3 -3
Q 0 -3 1 2 -3 -1 1 -3 1 0 0 1 0 5 2 1 0 -1 -1 -2
R -1 -2 -2 0 -2 0 2 -2 2 -1 -2 0 0 2 5 1 0 -1 0 -1
S 1 -1 0 1 -1 0 0 -1 0 -1 -1 2 0 1 1 5 2 -1 0 0
T 0 -1 0 1 -2 -1 1 0 0 0 0 0 0 0 0 2 5 0 -1 -2
V 0 -2 -2 -1 0 -1 -1 2 -1 2 1 -2 -1 -1 -1 -1 0 5 -1 0
W -2 -1 -3 -2 3 -2 0 9 -2 0 -2 -3 -3 -1 0 0 -1 -1 6 3
Y -2 -2 -2 -1 3 -3 1 0 -2 0 -1 -1 -3 -2 -1 0 -2 0 3 6

F↔F 6 F↔Y 3 F ↔ K -3
Alignment Matrix
V A T T P D K S W L T V
A
S
0
-1
5
1
0
2
0
2
1
0
0
0
0
0
1
5
-2
0
-1
-1
0
2
0
-1
Sequence A:
T 0 0 5 5 0 0 0 2 -1 0 5 0 VATTPDKSWLTV
P -1 1 0 0 8 0 0 0 -3 -2 0 -1
E -2 1 1 1 1 2 1 1 -2 -2 1 -2
R -1 -1 0 0 0 -2 2 1 0 -1 0 -1 Sequence B:
A 0 5 0 0 1 0 0 1 -2 -1 0 0
S -1 1 2 2 0 0 0 5 0 -1 2 -1
ASTPERASWLGTA
W -1 -2 -1 -1 -3 -3 -2 0 6 0 -1 -1
L 2 -1 0 0 -2 -2 -1 -1 0 5 0 2
G -1 0 -1 -1 0 0 0 0 -2 -2 -1 -1
T 0 0 5 5 0 0 0 2 -1 0 5 0
A 0 5 0 0 1 0 0 1 -2 -1 0 0

VATTPDK-SWLTV- VATTPDK-SWL-TV
|*||** ||| |*||** ||| |* Core
-ASTPERASWLGTA -ASTPERASWLGTA
score 39 score 45
Multiple Sequence Alignment

Sequence A: LTLTLTLT- -LTLTLTLT


LTLTLTLT HAHAHAHAH HAHAHAHAH
Sequence B: score -4 score 0
HAHAHAHAH
Sequence C:
THTHTHTHT -LTLTLTLT-
| | | |
THTHTHTHT-
| | | |
-HAHAHAHAH

The third sequence from a homologous protein allows alignment.


It’s a very good idea to have more than one template.

You might also like