Professional Documents
Culture Documents
prediction
Comparative/homology modeling
What is the reason for protein
structure prediction?
1. Solving protein structures experimentally is hard
(sometimes impossible)
2. Many predicted structures can be close to solved
structures in accuracy
3. Protein structure can provide important clues to protein
function
– Functional sites (e.g., enzyme active sites and specificity
determinants)
– Protein-protein interaction
– Docking studies (e.g., for drug interaction/design studies)
– See Baker and Sali paper for other uses
2
Assigned Reading for this section
3
Sources for this lecture
Park et al, “Sequence comparisons using multiple sequences detect three times
as many remote homologues as pairwise methods.” JMB 1998
David Baker and Andrej Sali, “Protein Structure Prediction and Structural
Genomics” Science 2001
Chothia and Lesk, “The relation between the divergence of sequence and
structure in proteins”, EMBO Journal 1986
Andrej Sali and Andras Fiser – selected slides from their seminars
Bioinformatics (Baxevanis and Ouellette, previous course text)
Chapter 8: Predictive methods using protein sequences (Ofran and Rost) 198-219
Chapter 9: Protein structure prediction and analysis (Wishart) 224-247
Chapter 12: Creation and analysis of protein multiple sequence alignments (Barton)
Topics Covered
• Folding pathways
• Primary, secondary, tertiary and quaternary protein structure
• Secondary (2D) structure prediction
• 3D fold prediction
– Ab initio protein structure prediction (briefly)
– Fold recognition (classification of an unknown protein to a fold
potentially without constructing a comparative model)
– Comparative model construction (aka homology model construction)
• Community evaluation of protein structure prediction
– Critical Assessment of protein Fold Prediction (CASP)
http://predictioncenter.org/
– EVA (real-time continuous evaluation of protein fold prediction
methods) http://cubic.bioc.columbia.edu/eva/
– Astral datasets
5
• Structural Genomics Initiative
The telescope: Protein structure prediction and
comparison
15% identity between VirB4 &
TrwB
6
Biological background
7
Primary,
Secondary,
Tertiary and
Quaternary
Structure
8
Hierarchical descriptions of proteins
(follows the folding process)
• Secondary structure: “regular local structure of linear segments of polypeptide chains” (Creighton)
– Helix (~35% of residues): subtypes: , and 310
– Beta sheet (~25% of residues)
– Both types predicted by Linus Pauling (Corey and Pauling, 1953;
helix first described by Pauling in 1951)
– Other less common structures:
• Beta turns
• 3/10 helices
• Ω loops
– Remaining unclassifiable regions sometimes termed “random coil” or “unstructured regions”
9
Baxevanis & Ouellette (Ch. 9, p.224, Wishart)
Information required for folding is (mostly)
contained in the primary sequence
• Early on, proteins were shown to fold into their native
structures in isolation
• This led to the belief that structure is determined by
sequence alone (Anfinsen, 1973)
• Over the last decade, a significant number of proteins
have been shown to not fold properly in the test tube
(e.g., requiring the assistance of chaperonins)
• Nevertheless, the native 3D structure is assumed to be
in some energetic minimum
• This led to the development of ab initio folding methods
10
Baxevanis & Ouellette (Ch. 9, Wishart)
Folding pathways
• Evidence that local structure segments form first, and
then pack against each other to form 3D fold
– Exploited in protein fold prediction, Rosetta method
• Simons, Bonneau, Ruczinski & Baker (1999). Ab initio Protein
Structure Prediction of CASP III Targets Using ROSETTA. Proteins
11
Baxevanis & Ouellette (Ch. 9, Wishart)
Proteins can diverge structurally and
functionally from a common ancestor
1AGT
1MYN Agitoxin 2
Egyptian Scorpion
(K+ channel inhibitor)
Drosomycin,
Antifungal protein
Fruit Fly
SCOP Scorpion-toxin-
related superfamily 1CN2
Toxin 2
Mexican scorpion
(Na+ channel inhibitor)
1BK8 1AYJ
Antimicrobial Protein 1 (Ah-Amp1) Antifungal protein 1 (RS-AFP1)
Common horse chestnut Radish
12
Sequence and structural divergence
are related
Identity 9.8%
Equivalent Residues 40%
14
SCOP comparison of 1e9y and 1j79**
1j79 (c.1.9.4) and one domain of 1e9y (d1e9yb2: c.1.9.2)
are placed in the same SCOP superfamily (c.1.9.*)
15
Sequence and structural divergence
are correlated**
Accuracy of sequence alignment relative to structural alignment
Right three columns give Cline Shift scores for pairwise sequence alignments relative to the structural
alignment. The best CS score possible is 1; negative scores indicate incorrect over-alignment with very
16
few (or no) correctly aligned residue pairs.
Assessing sequence alignment
with respect to structural alignment**
Xia Jiang Duncan Brown Nandini Krishnamurthy Kimmen Sjolander
17
Protein “domains” can be defined in many ways
(structural, evolutionary, functional)
Leucine-Rich
Repeat (LRR)
Toll-Interleukin
Receptor (TIR)
domain
Kinase domain
SCOP: 1196
unique folds
https://scop.berkeley.edu/statistics/ver=2.06
CATH: 1373
unique folds
http://www.cathdb.info
This only counts the number of folds found in current solved structures – it
does not count the folds that exist in nature (which may be hard to solve or 20
which crystallographers haven’t yet tried to solve)!
Major protein structure resources
21
SCOP and CATH structure hierarchies**
22
3D protein structure superposition**
• Example tools: J-FATCAT, CE, VAST.
• Used to evaluate protein 3D structure prediction
– Compare homology models against solved structure (e.g., CASP)
• To evaluate assertions of (distant) homology
– Can be used to rule out homology (if structurally dissimilar)
– Structural similarity does not automatically support homology
• see convergent evolution
• Used to organize protein structures into hierarchies
– E.g., SCOP and CATH
• Used to evaluate sequence alignment accuracy
– Some (not all) MSA benchmarks use pairwise structural alignments (multiple structural alignment is more
complicated) – but some benchmarks include proteins that do not have solved structures
– Even among homologous proteins, some regions may superpose poorly. Structural aligners can disagree on how
to align these regions. Benchmarking approaches may use consensus approaches across multiple structural
aligners
– See discussion of these benchmarking issues in Pevsner.
23
Protein Data Bank (PDB) resources
• Repository of solved structures
– Browse data online or download coordinates for viewing/interacting
locally (using tools such as Pymol and Jmol that you can download
onto your computer)
24
Structure Superposition
@ RCSB/PDB
25
http://www.rcsb.org/pdb/secondary.do?p=v2/secondary/analyze.jsp#Sequence
FATCAT structural alignment of
Scorpion toxin and Drosomycin
26
FATCAT structural alignment
Drosomycin & radish antifungal protein
Pairwise alignment
shows few
insertions &
deletions
27
VAST Structural Alignment
at NCBI
http://www.ncbi.nlm.nih.gov/Structure/VAST/vast.shtml
Then
select
PDB
structures
for which
you want
to see a
structural
alignment
29
VAST alignment**
30
Structural Genomics Initiative
31
Structural Genomics**
34
Principles of Protein Structure Prediction
GFCHIKAYTRLIMVG…
Desulfovibrio vulgaris
Anacystis nidulans
Condrus crispus
Anabaena 7120
folding evolution
Fold Recognition
Ab initio prediction Comparative Modeling
Andras Fiser, Albert Einstein College of Medicine
Comparative Protein Structure Modeling
Ca RMSD Å (% EQV)
2 (50) 1 (80) 0 (100)
Anacystis nidulans
Flavodoxin family
Anabaena 7120
COMPARATIVE
MODELING
Condrus crispus
KIGIFFSTSTGNTTEVA…
Desulfovibrio vulgaris
Clostridium mp.
20 50 100
% SEQUENCE IDENTITY Andras Fiser, Albert Einstein College of Medicine
The utility of a
comparative
model depends
on its
accuracy**
Errors related to sequence
and structural divergence
(recall Chothia and Lesk
analyses of structural
conservation, and alignment
errors as a function of
sequence divergence)
Accuracy and applicability are limited Accuracy and applicability are rather
by our understanding of the protein limited by the number of known folds
folding problem
Andras Fiser, Albert Einstein College of Medicine
Overview of structure prediction methods and
pros and cons**
• Ab initio methods (simulate folding process)
– Limited to short sequences (<100aa)
– Generally not as accurate as comparative models
• Comparative modeling
– High accuracy for closely related target (protein of unknown structure) and template PDB
structure (if >50%ID)
– Model accuracy degrades with evolutionary distance between the target and template
– Major errors due to:
• Alignment errors
• Non-superposable positions
• Low resolution templates
• Hybrid approach: Best methods combine the best of both to piece together the
structural puzzle
– comparative modeling for homologous segments of structure (predicted by sequence similarity)
– ab initio techniques for apparently divergent sections
– libraries of structural fragments (loops, etc.)
39
Steps in Comparative Protein Structure Modeling
START TARGET TEMPLATE
ASILPKRLFGNCEQTSDEGLK
Template Search IERTPLVPHISAQNVCLKIDD
VPERLIPERASFQWMNDK
Model Building
Model Evaluation
No
OK?
Yes
END
Andras Fiser, Albert Einstein College of Medicine
Steps in Comparative Protein Structure Modeling
START TARGET TEMPLATE
ASILPKRLFGNCEQTSDEGLK
Template Search IERTPLVPHISAQNVCLKIDD
VPERLIPERASFQWMNDK
Model Building
Model Evaluation
No
OK?
Yes
END
Andras Fiser, Albert Einstein College of Medicine
Typical Errors in Comparative Models**
Incorrect template Misalignment
MODEL
X-RAY
TEMPLATE
Model Evaluation
No
OK?
Yes
END
Target-template alignment
START
Model Building
No
OK?
Yes
END
Andras Fiser, Albert Einstein College of Medicine
Comparative model evaluation
START
• Stereochemistry (PROCHECK,
Template Search
WHATCHECK)
• Environment (Profiles3D, Verify3d)
Target – Template • Statistical potentials based methods
Alignment
(PROSAII)
Model Building
48
Baxevanis & Ouellette (Ch. 9, Wishart)
Basic types of secondary
structure**
• Helices ( and others)
is most common; 3.6 residues/turn
– Side chains project outward
– Structure is stabilized between hydrogen bonds between the
carbonyl (CO) group of one amino acid and the amino (NH) group
of the amino acid that is 4 positions C-terminal to it
-Strands (two or more strands interact to form a -sheet)
• Other (sometimes called loop, coil, or non-regular)
• Most secondary structure prediction methods classify
residues to one of three states
49
Baxevanis & Ouellette (Ch. 9, Wishart)
Early methods
50
Early schemes used observed preferences **
51
http://www.chembio.uoguelph.ca/educmat/phy456/456lec01.htm
Amino acid propensities for
different structural environments
• Propensities are weak but contribute to prediction accuracy
– E.g., Glu (E) occurs in alpha helices only 59% more frequently than
random
• Helical propensities
– Partial charge of helix dipole favors
• Acidic Asp (D) and Glu (E) residues at N-terminus of helices
• Basic Lys (K), Arg (R ) and His (H) residues at C-terminus
– Pro (P) residues are more common at the N-terminal first turn of helix
– Asn (N), Asp (D), Ser (S) and Thr (T) residues often occur at first turn
of helix (side chain hydrogen bonding to backbone of third residue)
52
Creighton, Proteins
The next generation of 2ary
structure prediction**
Improved performance through:
•Use of homologs
•Peer pressure (window)
•Better training sets
• Learning secondary structure preferences from expanded data sets: More recent
prediction schemes take advantage of larger data sets to examine amino acid
preference for different regions in a helix or different positions in a tight turn.
• Improved accuracy: The accuracy of prediction has risen from about 55% using the
simple Chou-Fasman method, where the tendency is to overpredict, to almost 80%
using current methods.
55
http://www.chembio.uoguelph.ca/educmat/phy456/456lec01.htm
Amino acid patterns indicative of
-strand structures
• Short runs of conserved hydrophobic
– Buried -strand
• An i, i+2, i+4 pattern of conserved hydrophobic
residues suggests a surface -strand.
– Hydrophobic residues will face the interior of the
molecule, not the surface
• Conserved residues sharing the same
physicochemical properties are likely to form
one face of a strand.
56
Baxevanis & Ouellette (Ch. 12, Barton)
Amino acid patterns indicative of
-helical structures
57
Baxevanis & Ouellette (Ch. 12, Barton)
Identifying loop/coil regions
58
Baxevanis & Ouellette (Ch. 12, Barton)
Amino acid preferences for different secondary structures
(and identifying loops/turns)
59
http://www.chembio.uoguelph.ca/educmat/phy456/456lec01.htm
Machine learning methods of
secondary structure prediction
• Based on machine learning
concepts
– Training set: learn implicit rules,
principles and model parameters
from labelled data (sequences
whose secondary structures are
known for each position)
– Used machine learning method
called artificial neural networks
(designed to simulate biological
neural networks in the brain)
– PHDsec (Rost et al 1994, Rost et
al 1996)
60
Baxevanis & Ouellette Ch 8 (Ofran and Rost)
Neural Network for
Protein Structure Prediction**
61
Ofran and Rost commentary
(from Baxevanis text)
62
Key to success in machine learning
algorithms**
63
Baxevanis & Ouellette Ch 8 (Ofran and Rost)
Assessing performance evaluations**
64
Baxevanis & Ouellette Ch 8 (Ofran and Rost)
Other problems with comparing
different methods**
• Performance reported in literature can take different forms
– Accuracy and coverage
– Positive (or negative) predictive power
– Sensitivity and specificity
– Machine learning terms (e.g., Matthews coefficients)
– Wilcoxon paired score signed rank tests
65
Baxevanis & Ouellette Ch 8 (Ofran and Rost)
How do the methods compare? **
• Best methods now reach 76% accuracy at 3-state
prediction (helix, strand, random coil)
– Rost 2001
– See EVA website for detailed comparisons
• Metaservers:
– Consensus approaches combining weighted predictions
from different servers
– These almost always outperform individual methods
– Shown in both CASP and EVA
66
Baxevanis & Ouellette Ch 8 (Ofran and Rost)
Caveats**
• Even when an experimental structure is available, it is
sometimes unclear where one secondary structure element
ends and another begins
• Low-confidence predictions (and regions of disagreement
across servers) can correspond to structurally ambiguous
regions
• Real-life example: Prion protein (involved in bovine spongiform
encephalopathy, Creutzfeld-Jakob disease, etc).
– Region assumed to be responsible for aggregation believed to flip from
experimentally determined helical structure to (predicted) strand in
diseased individuals
– All the best secondary structure prediction methods predict this region to
be beta (“incorrect”)
67
Baxevanis & Ouellette Ch 8 (Ofran and Rost)
Secondary structure prediction
programs
• PSI-PRED (David Jones; makes use of distant
homologs detected using PSI-BLAST - most popular)
• JNET (Cuff & Barton)
• PHD (Rost & Sander)
68
Baxevanis & Ouellette Ch 8 (Ofran and Rost)
PSIPRED
69
Consensus (or jury) methods
70
Consensus and jury approaches
produce best results**
• Primary conclusion of CASP experiments:
– Structure prediction meta-servers (which combine results from several
independent prediction methods) have the highest accuracy
– improving both secondary and tertiary structure prediction
• This kind of consensus approach can be applied to both the template
selection and the pairwise alignment between the target and template
• Related to machine learning task of combining multiple weak learners into a
robust system
• Key trick: figuring out how to weight each prediction source (or “learner”),
and potentially to use a context-dependent weighting scheme
• HW2 (Fall 2016) asks you to use a consensus approach from multiple
webservers and databases to predict the multi-domain architecture of a
protein.
71
Example consensus approach
to secondary structure prediction
72
Overview and issues***
• Overview: multiple secondary structure predictors combined into one consensus prediction
– In general: applied to a single sequence
– May exploit information from homologs (e.g., PSI-PRED, using PSI-BLAST to retrieve
and align homologs to distinguished sequence)
– A consensus is derived across these predicted labels (typically, a weighted majority
rule, in which the weights are determined on a reserved training set)
• Benchmarking: What is the gold standard for evaluating secondary structure prediction?
– 3D structure examination
• Limitations of using a simple majority rule consensus
– Smoothing techniques such as peer pressure or windowing are still needed
– Possible correlation among predictors (uniform weighting problematic)
• Other issues:
– Including homologs improves performance accuracy, but keep in mind that secondary
structures can diverge over evolutionary distance, so sequence weighting techniques
should upweight the contributions of more closely related sequences.
73
Figure 13.1, Pevsner
Secondary structure prediction for human beta globin
(HBB_HUMAN, NP_000539)
Evaluating secondary structure
prediction accuracy
• Compare Figure 13.1 (pg 592) of Pevsner showing secondary structure
prediction for human beta globin (hemoglobin subunit beta) with the
secondary structure labelling of the same sequence in SwissProt*
• Focus on the region from 21-35 (VDE…LVV)
• SwissProt: The region from 21-35 is labelled as helical (based on manual
examination of structural evidence; see next slide) –
– but not all positions in HBB_HUMAN have secondary structure labels!
75
SwissProt record for
HBB_HUMAN
77
Following the link to the supporting
evidence from structure
78
Note that this
structure has
several chains
(corresponding
to different
subunits)
79
PDB structural annotations
80
3D-structure prediction
82
Baxevanis & Ouellette Ch 8 (Ofran and Rost)
Threading**
• Limited to generating approximate models or suggesting
approximate folds
– >5 Angstroms for 3D threading
– >3Angstroms for 2D threading
• Name based on “threading” a tube (called a snake) through
a plumbing system.
• Each unique threading of a sequence through the 3D
model can be evaluated using empirically derived energy
function or measure of packing efficiency
• Sequences can be scored based on how well they fit the
model (i.e., the best score achievable)
83
Baxevanis & Ouellette Ch 9 (Wishart)
Three-dimensional threading***
• First described by Novotny et al (1984)
• Rediscovered in early 1990s
– Jones et al 2992; Sippl & Weitckus 1992; Bryant & Lawrence 1993
– Based largely on heuristic contact potentials (interactions between pairs of
residues)
– 3D coordinates of theoretical structure (based on threading of sequence
through PDB structure model) used to evaluate predicted contacts and derive a
fitness score based on a pseudoenergy function
• Powerful for predicting 3D structure of unknown proteins, and for
evaluating structure of known proteins
• Limitations found in this method:
– interactions are not always conserved between distant homologs
– Computational complexity (very slow)
– Modest accuracy (early methods ignored amino acid information; model
accuracy >5Angstroms)
84
Baxevanis & Ouellette Ch 9 (Wishart)
Contact maps**
• 2D plots of
distances between
C-alpha atoms of
all pairs of residues
– Observed
interactions
between amino
acids used to form
“contact potentials”
for 3D threading
methods
Figure 6.14 85
Creighton, Proteins Ch. 6
Two-dimensional threading**
• Sequence-profile methods; most often based exclusively on amino acid
similarity (e.g., PSI-BLAST, most HMMs) to score and align proteins
• Improved accuracy through combined use of 2ary structure prediction
(matching predicted secondary structure of target to predicted or known
secondary structure labels of template) and amino acid similarity;
• Predicted solvent accessibility is also occasionally included in these systems
(and compared against known structural environments)
• Advantage: Much faster than standard 3D threading
• Model accuracy good but not excellent (RMSD >3 Angstroms)
– However, for model construction for proteins with no close homologs with solved
structure, these methods are among the best
• Examples:
– UCSC SAMT99 (two-track HMMs), PHYRE, FUGUE
86
Baxevanis & Ouellette Ch 9 (Wishart)
Assessing method performance
88
Baxevanis & Ouellette Ch 8 (Ofran and Rost)
Critical Assessment of Protein
Structure Prediction (CASP)
Kryshtafovych et al, “Progress over the First
Decade of CASP Experiments” Proteins:
Structure, Function and Genetics 2005
90
Rosetta/Robetta
91
Red=first model
Yellow=models 2-5
Black=other groups
93
Red=first model
Yellow=models 2-5
Black=other groups
94
95
96
Selected protein structure
prediction servers
• Superfamily (Sequence-profile alignment; UCSD,
MRC/Cambridge, U. Bristol UK)
– http://supfam.org/SUPERFAMILY/index.html
• PHYRE (Profile-profile alignment; Imperial College of London)
– Recommended
– http://www.sbg.bio.ic.ac.uk/phyre/
• SwissModel (Swiss Institute of Bioinformatics)
– http://swissmodel.expasy.org//SWISS-MODEL.html
• MODBASE (precomputed models; Sali lab at UCSF)
– http://modbase.compbio.ucsf.edu/modbase-cgi/index.cgi
97
Summary (1)***
• Experimental determination of protein structure is
expensive and not always straightforward
98
Summary (2) ***
• Ab initio methods of protein fold prediction use physics-based energy minimization to simulate
the process of protein folding
– These methods are generally less successful than homology-based fold prediction (limited to short
peptides/small proteins)
– Exception: Rosetta/I-sites methods (Baker group) which employ both types of approach
– Threading approaches are sometimes attempted to predict structure for non-homologous molecules
(but this is rarely very successful)
99
Summary (3) ***
• Community assessment of 2D and 3D structure prediction uses various approaches
– EVA and LiveBench (continuous real-time assessment of methods)
– CASP (Critical Assessment of Protein Structure Prediction)
– Benchmark datasets (e.g., Astral PDB40 for fold recognition by Park et al)
• Fold prediction (ignoring the comparative model construction) is fairly accurate for the best servers
provided
– A homologous structure has already been deposited in the PDB
– That structure can be detected with a significant E-value using sequence information alone, e.g., by PSI-
BLAST)
• The inclusion of 2ary structure prediction (e.g., in 2D profiles) can improve the alignment and give a
modest boost to fold recognition accuracy when %ID is very low, but must be integrated with
sequence similarity appropriately to avoid errors in prediction
100
Questions on the reading
1. What is the single most significant source of error in a comparative
•
model construction, if based on a template with <30% identity with the
David Baker and
target but that is detectable by BLAST (with a significant E-value)?
Andrej Sali, “Protein
Structure Prediction 2. What is an additional probable source of error if the percent identity
and Structural drops below 20% (detectable by PSI-BLAST with a significant E-value)?
Genomics” Science
3. What is the reason cited by Baker and Sali for why errors in a
2001
comparative model tend to not lie in functionally important sites such as
an enzyme active site?
4. What example do Baker and Sali give to demonstrate the utility of a low-
accuracy comparative model?
5. How would protein-protein interaction interfaces be predicted with a
comparative model?
6. What are the possible applications of a comparative model?
7. What fraction (approximately) of comparative models produced by
Rosetta for proteins <150 residues in length are considered accurate?
8. Does model refinement improve models or not? Under what conditions?
101