You are on page 1of 101

Protein structure

prediction
Comparative/homology modeling
What is the reason for protein
structure prediction?
1. Solving protein structures experimentally is hard
(sometimes impossible)
2. Many predicted structures can be close to solved
structures in accuracy
3. Protein structure can provide important clues to protein
function
– Functional sites (e.g., enzyme active sites and specificity
determinants)
– Protein-protein interaction
– Docking studies (e.g., for drug interaction/design studies)
– See Baker and Sali paper for other uses

2
Assigned Reading for this section

From Pevsner text:


•Chapter 6 Multiple Sequence Alignment 205-227, 234
(Read-by: 9/27)
•Chapter 13 Protein Structure 589-625 (Read-by: 9/29)
“Protein Structure Prediction and Structural Genomics”,
David Baker and Andrej Sali, Science 2001. (Read-by:
9/29)

3
Sources for this lecture
Park et al, “Sequence comparisons using multiple sequences detect three times
as many remote homologues as pairwise methods.” JMB 1998
David Baker and Andrej Sali, “Protein Structure Prediction and Structural
Genomics” Science 2001
Chothia and Lesk, “The relation between the divergence of sequence and
structure in proteins”, EMBO Journal 1986
Andrej Sali and Andras Fiser – selected slides from their seminars
Bioinformatics (Baxevanis and Ouellette, previous course text)
Chapter 8: Predictive methods using protein sequences (Ofran and Rost) 198-219
Chapter 9: Protein structure prediction and analysis (Wishart) 224-247
Chapter 12: Creation and analysis of protein multiple sequence alignments (Barton)
Topics Covered
• Folding pathways
• Primary, secondary, tertiary and quaternary protein structure
• Secondary (2D) structure prediction
• 3D fold prediction
– Ab initio protein structure prediction (briefly)
– Fold recognition (classification of an unknown protein to a fold
potentially without constructing a comparative model)
– Comparative model construction (aka homology model construction)
• Community evaluation of protein structure prediction
– Critical Assessment of protein Fold Prediction (CASP)
http://predictioncenter.org/
– EVA (real-time continuous evaluation of protein fold prediction
methods) http://cubic.bioc.columbia.edu/eva/
– Astral datasets
5
• Structural Genomics Initiative
The telescope: Protein structure prediction and
comparison
15% identity between VirB4 &
TrwB

Protein structure prediction and


VirB4 model TrwB PDB structure comparison provides a kind of
Hubble telescope enabling distant
homologies to be revealed

6
Biological background

And major protein structure resources

7
Primary,
Secondary,
Tertiary and
Quaternary
Structure

8
Hierarchical descriptions of proteins
(follows the folding process)

• Primary structure: the amino acid sequence

• Secondary structure: “regular local structure of linear segments of polypeptide chains” (Creighton)
– Helix (~35% of residues): subtypes: ,  and 310
– Beta sheet (~25% of residues)
– Both types predicted by Linus Pauling (Corey and Pauling, 1953;
 helix first described by Pauling in 1951)
– Other less common structures:
• Beta turns
• 3/10 helices
• Ω loops
– Remaining unclassifiable regions sometimes termed “random coil” or “unstructured regions”

• Tertiary structure: “Overall topology of the folded polypeptide chain” (Creighton)


– Mediated by hydrophobic interactions between distant parts of protein

• Quaternary structure: “Aggregation of the separate polypeptide chains of a protein” (Creighton)

9
Baxevanis & Ouellette (Ch. 9, p.224, Wishart)
Information required for folding is (mostly)
contained in the primary sequence
• Early on, proteins were shown to fold into their native
structures in isolation
• This led to the belief that structure is determined by
sequence alone (Anfinsen, 1973)
• Over the last decade, a significant number of proteins
have been shown to not fold properly in the test tube
(e.g., requiring the assistance of chaperonins)
• Nevertheless, the native 3D structure is assumed to be
in some energetic minimum
• This led to the development of ab initio folding methods

10
Baxevanis & Ouellette (Ch. 9, Wishart)
Folding pathways
• Evidence that local structure segments form first, and
then pack against each other to form 3D fold
– Exploited in protein fold prediction, Rosetta method
• Simons, Bonneau, Ruczinski & Baker (1999). Ab initio Protein
Structure Prediction of CASP III Targets Using ROSETTA. Proteins

• Semi-stable structural intermediates on folding pathway


to lowest-energy conformation
– Prof. Susan Marqusee, Berkeley

11
Baxevanis & Ouellette (Ch. 9, Wishart)
Proteins can diverge structurally and
functionally from a common ancestor
1AGT

1MYN Agitoxin 2
Egyptian Scorpion
(K+ channel inhibitor)
Drosomycin,
Antifungal protein
Fruit Fly
SCOP Scorpion-toxin-
related superfamily 1CN2
Toxin 2
Mexican scorpion
(Na+ channel inhibitor)

1BK8 1AYJ
Antimicrobial Protein 1 (Ah-Amp1) Antifungal protein 1 (RS-AFP1)
Common horse chestnut Radish
12
Sequence and structural divergence
are related

“The relation between the divergence of sequence and structure in proteins”,


Chothia and Lesk. EMBO Journal 1986
Structural alignment example
ID EC Function
1E9Y 3.5.1.5 Urease
1J79 3.5.2.3 Dihydroorotase

Identity 9.8%
Equivalent Residues 40%
14
SCOP comparison of 1e9y and 1j79**
1j79 (c.1.9.4) and one domain of 1e9y (d1e9yb2: c.1.9.2)
are placed in the same SCOP superfamily (c.1.9.*)

15
Sequence and structural divergence
are correlated**
Accuracy of sequence alignment relative to structural alignment

Left three columns show results of structural alignment


%ID: Structure pairs have been placed into bins based on sequence identity given the structural alignment
#pair: number of pairs in each bin
%Superpos: percent positions that are within ~3Angstroms RMSD (between backbone C-alpha carbons)

Right three columns give Cline Shift scores for pairwise sequence alignments relative to the structural
alignment. The best CS score possible is 1; negative scores indicate incorrect over-alignment with very
16
few (or no) correctly aligned residue pairs.
Assessing sequence alignment
with respect to structural alignment**
Xia Jiang Duncan Brown Nandini Krishnamurthy Kimmen Sjolander

17
Protein “domains” can be defined in many ways
(structural, evolutionary, functional)

Pfam “domains” sometimes (but not always)


18
correspond to structural domains
Proteins are composed of modular structural domains which are found in
different domain architectures produced by gene fusion and fission
events

Leucine-Rich
Repeat (LRR)

Toll-Interleukin
Receptor (TIR)
domain

Kinase domain

Promiscuous domains complicate homolog detection and function prediction 19


How many unique folds are there?

SCOP: 1196
unique folds

https://scop.berkeley.edu/statistics/ver=2.06

CATH: 1373
unique folds

http://www.cathdb.info
This only counts the number of folds found in current solved structures – it
does not count the folds that exist in nature (which may be hard to solve or 20
which crystallographers haven’t yet tried to solve)!
Major protein structure resources

21
SCOP and CATH structure hierarchies**

• SCOP: class, fold, superfamily, family


• Classification of individual structural domains (independently folding
globular building blocks)
• Placement in the same SCOP fold:
– implies similar topology and overall “shape”;
– may or may not have a common ancestor
• Same superfamily:
– more restrictive; implies a common ancestor
– Inferred based on various analyses (including evidence from PSI-BLAST,
HMMs, functional similarity)
• Same family
– Even more restrictive; generally implies a similar function

22
3D protein structure superposition**
• Example tools: J-FATCAT, CE, VAST.
• Used to evaluate protein 3D structure prediction
– Compare homology models against solved structure (e.g., CASP)
• To evaluate assertions of (distant) homology
– Can be used to rule out homology (if structurally dissimilar)
– Structural similarity does not automatically support homology
• see convergent evolution
• Used to organize protein structures into hierarchies
– E.g., SCOP and CATH
• Used to evaluate sequence alignment accuracy
– Some (not all) MSA benchmarks use pairwise structural alignments (multiple structural alignment is more
complicated) – but some benchmarks include proteins that do not have solved structures
– Even among homologous proteins, some regions may superpose poorly. Structural aligners can disagree on how
to align these regions. Benchmarking approaches may use consensus approaches across multiple structural
aligners
– See discussion of these benchmarking issues in Pevsner.

23
Protein Data Bank (PDB) resources
• Repository of solved structures
– Browse data online or download coordinates for viewing/interacting
locally (using tools such as Pymol and Jmol that you can download
onto your computer)

• Structure comparison/superposition tools


– View structure superposition and also pairwise sequence
alignment derived from structure superposition

• Structure visualization tools

24
Structure Superposition
@ RCSB/PDB

25
http://www.rcsb.org/pdb/secondary.do?p=v2/secondary/analyze.jsp#Sequence
FATCAT structural alignment of
Scorpion toxin and Drosomycin

26
FATCAT structural alignment
Drosomycin & radish antifungal protein

Pairwise alignment
shows few
insertions &
deletions

27
VAST Structural Alignment
at NCBI

Type in the PDB structure ID of interest.

http://www.ncbi.nlm.nih.gov/Structure/VAST/vast.shtml
Then
select
PDB
structures
for which
you want
to see a
structural
alignment

29
VAST alignment**

• How to read the VAST alignment


– Pile-up alignment (master-slave): additional structures are aligned to the query, but not to each
other
– Grey characters are not part of the consensus “core structure” defining all structures
– Upper-case characters indicate structural equivalence between query and other structure(s)
– Lower-case characters indicate non-equivalence
• Structural alignments are affected by the resolution of each structure
– these are (mostly) NMR structures, “blurrogram” (only 1SN4 is an X-ray crystal structure)
• Structural alignment can disagree with sequence-based alignment
– Note that the motif “TWSG” is found in both 1BK8 and 1AYJ, but they are not aligned by
VAST

30
Structural Genomics Initiative

31
Structural Genomics**

Characterize most protein sequences (red) based on related


known structures (green).

The number of “folds” is much


smaller than the number of
proteins

Andras Fiser, Albert Einstein College of Medicine


Why Protein Structure Prediction?

18,420,180 UniRef50 clusters (50% identity and 80% overlap)


34,509 clusters in PDB (at 50%ID, 90% overlap)
(http://www.rcsb.org/pdb/statistics/clusterStatistics.do)

#PDB structures/#sequences: .0018

It’s much harder to solve protein 3D structures than


to sequence a new genome (and most “new”
structures are similar to existing structures)
Overview of protein 3D structure
prediction methods and principles

Including major sources of error

34
Principles of Protein Structure Prediction

GFCHIKAYTRLIMVG…

Desulfovibrio vulgaris

Anacystis nidulans
Condrus crispus

Anabaena 7120
folding evolution
Fold Recognition
Ab initio prediction Comparative Modeling
Andras Fiser, Albert Einstein College of Medicine
Comparative Protein Structure Modeling
Ca RMSD Å (% EQV)
2 (50) 1 (80) 0 (100)

Anacystis nidulans
Flavodoxin family

Anabaena 7120
COMPARATIVE
MODELING

Condrus crispus
KIGIFFSTSTGNTTEVA…

Desulfovibrio vulgaris

Clostridium mp.

20 50 100
% SEQUENCE IDENTITY Andras Fiser, Albert Einstein College of Medicine
The utility of a
comparative
model depends
on its
accuracy**
Errors related to sequence
and structural divergence
(recall Chothia and Lesk
analyses of structural
conservation, and alignment
errors as a function of
sequence divergence)

David Baker and Andrej Sali, Protein Structure


37
Prediction and Structural Genomics, Science 2001
Protein structure modeling**

Ab initio prediction Comparative Modeling


Applicable to those sequences only that
Applicable to any sequence share recognizable similarity to a template
structure (KS note: advanced techniques
enable detection of very distant homologies)

Fairly accurate ( <3 Ang RMSD), typically


Not very accurate (>4 Ang RMSD), comparable to a low resolution X-ray
experiment. (KS note: model accuracy drops
with sequence divergence, although active
sites are often correctly modeled and the
overall fold can be roughly correct)

Attempted for proteins of <100 residues Not limited by size

Accuracy and applicability are limited Accuracy and applicability are rather
by our understanding of the protein limited by the number of known folds
folding problem
Andras Fiser, Albert Einstein College of Medicine
Overview of structure prediction methods and
pros and cons**
• Ab initio methods (simulate folding process)
– Limited to short sequences (<100aa)
– Generally not as accurate as comparative models
• Comparative modeling
– High accuracy for closely related target (protein of unknown structure) and template PDB
structure (if >50%ID)
– Model accuracy degrades with evolutionary distance between the target and template
– Major errors due to:
• Alignment errors
• Non-superposable positions
• Low resolution templates
• Hybrid approach: Best methods combine the best of both to piece together the
structural puzzle
– comparative modeling for homologous segments of structure (predicted by sequence similarity)
– ab initio techniques for apparently divergent sections
– libraries of structural fragments (loops, etc.)

39
Steps in Comparative Protein Structure Modeling
START TARGET TEMPLATE

ASILPKRLFGNCEQTSDEGLK
Template Search IERTPLVPHISAQNVCLKIDD
VPERLIPERASFQWMNDK

Target – Template ASILPKRLFGNCEQTSDEGLKIERTPLVPHISAQNVCLKIDDVPERLIPE


MSVIPKRLYGNCEQTSEEAIRIEDSPIV---TADLVCLKIDEIPERLVGE
Alignment

Model Building

Model Evaluation

No
OK?

Yes
END
Andras Fiser, Albert Einstein College of Medicine
Steps in Comparative Protein Structure Modeling
START TARGET TEMPLATE

ASILPKRLFGNCEQTSDEGLK
Template Search IERTPLVPHISAQNVCLKIDD
VPERLIPERASFQWMNDK

Target – Template ASILPKRLFGNCEQTSDEGLKIERTPLVPHISAQNVCLKIDDVPERLIPE


MSVIPKRLYGNCEQTSEEAIRIEDSPIV---TADLVCLKIDEIPERLVGE
Alignment

Model Building

Model Evaluation

No
OK?

Yes
END
Andras Fiser, Albert Einstein College of Medicine
Typical Errors in Comparative Models**
Incorrect template Misalignment

MODEL

X-RAY

TEMPLATE

Region without a Distortion in correctly


Side chain packing
template aligned regions

Andras Fiser, Albert Einstein College of Medicine


Template identification
START • Fast but less sensitive: e.g. BLAST
• Better: Intermediate sequence search
Template Search • Even better: Profile/HMM and iterative search
methods (e.g. PSI-BLAST)
– Searching against libraries of HMMs and profiles for
Target – Template solved structures
Alignment • Profile-profile alignment (e.g., Hhalign, PHYRE)
– Including 2ary structure prediction
Model Building • Structure-based threading

Model Evaluation

No
OK?
Yes
END
Target-template alignment
START

Template Search • Note that the methods for identifying


candidate templates normally produce an
Target – Template alignment
Alignment – but these alignments are unlikely to be optimal
• The alignment method used must be
Model Building tuned to the level of evolutionary
divergence between the target and
template
Model Evaluation
• Manual refinement/editing of the
alignment is often used to improve the
No
OK? comparative model
Yes
END
Constructing a comparative model
START

Template Search • Rigid Body Assembly (COMPOSER)


• Segment Matching (SEGMOD, 3DPSSM)
Target – Template • Satisfaction of Spatial Restraints (MODELLER)
Alignment • Integrated (NEST)

Model Building

Model Evaluation loop modeling, side chain modeling

No
OK?

Yes
END
Andras Fiser, Albert Einstein College of Medicine
Comparative model evaluation
START

• Stereochemistry (PROCHECK,
Template Search
WHATCHECK)
• Environment (Profiles3D, Verify3d)
Target – Template • Statistical potentials based methods
Alignment
(PROSAII)

Model Building

Model Evaluation Is the model reliable?


A model is reliable when it is based on a
No correct template and on an approximately
OK?
correct alignment.
Yes
END
Andras Fiser, Albert Einstein College of Medicine
Secondary Structure Prediction
Why is secondary structure
prediction important?

• Secondary structure diverges less rapidly


than primary sequence
– Knowledge or prediction of 2ary structure
improves detection and alignment of remote
homologs
• 3d-pssm, PHYRE, SAM T02 (fold prediction servers)

48
Baxevanis & Ouellette (Ch. 9, Wishart)
Basic types of secondary
structure**
• Helices ( and others)
  is most common; 3.6 residues/turn
– Side chains project outward
– Structure is stabilized between hydrogen bonds between the
carbonyl (CO) group of one amino acid and the amino (NH) group
of the amino acid that is 4 positions C-terminal to it
 -Strands (two or more strands interact to form a -sheet)
• Other (sometimes called loop, coil, or non-regular)
• Most secondary structure prediction methods classify
residues to one of three states

49
Baxevanis & Ouellette (Ch. 9, Wishart)
Early methods

Limited accuracy (both due to


methodological reasons and
because of limited data)

50
Early schemes used observed preferences **

• Various schemes give the amino acids numerical weights or


rankings for their preferences, and several computer programs
can predict the secondary structure from the given sequence.

• Preferences are weak, but provide some signal

• The simplest such scheme of Chou and Fasman, Ann. Rev


Biochem. (1978), examined the statistical distribution of amino
acids in alpha helix, beta sheet and turns or loops, using a set of
known protein structures from the protein databank.

• A novel sequence can then be scanned, and the tendency of each


portion of the sequence to form secondary structure is assessed.

51
http://www.chembio.uoguelph.ca/educmat/phy456/456lec01.htm
Amino acid propensities for
different structural environments
• Propensities are weak but contribute to prediction accuracy
– E.g., Glu (E) occurs in alpha helices only 59% more frequently than
random
• Helical propensities
– Partial charge of helix dipole favors
• Acidic Asp (D) and Glu (E) residues at N-terminus of helices
• Basic Lys (K), Arg (R ) and His (H) residues at C-terminus
– Pro (P) residues are more common at the N-terminal first turn of helix
– Asn (N), Asp (D), Ser (S) and Thr (T) residues often occur at first turn
of helix (side chain hydrogen bonding to backbone of third residue)

52
Creighton, Proteins
The next generation of 2ary
structure prediction**
Improved performance through:
•Use of homologs
•Peer pressure (window)
•Better training sets

Integration of conservation and residue patterns in


prediction
•Exposed/surface vs buried (hydrophobicity, amphiphilic
patterns)
•Periodicity of conservation patterns (differentiate between
53
helical and strand)
Secondary structure prediction accuracy
is boosted by using homologs**
• Labeling residues in a sequence as -helix, -sheet or turn/coil (3-
state prediction).
• Accuracy of prediction enhanced by ~6% when multiple sequence
alignments are used vs the use of a single sequence (Cuff &
Barton, 1999)
• Best methods for 2ary structure prediction -- PSIPRED (Jones
1999) and JNET (Cuff & Barton, unpublished)
– Make use of homologs obtained using PSI-BLAST
– Have ~>76% accuracy for 3-state prediction
– Provide confidence values for each position

How is this result related to the findings of Park et al?


54
Baxevanis & Ouellette (Ch. 12, Barton)
Improving secondary structure prediction using
evolutionary conservation and “peer pressure” **
• Peer pressure (pressure from the neighbors): A minimum of 4 amino acids out of
6 should show alpha preference, or 3 out of 5 beta preference, or clusters of 2-3
breakers in a sequence of 4 are needed to set the secondary structure in any region,
and individual misfits adopt the secondary structure of their neighbours.

• Learning secondary structure preferences from expanded data sets: More recent
prediction schemes take advantage of larger data sets to examine amino acid
preference for different regions in a helix or different positions in a tight turn.

• Up-weighting conserved residues: In addition, sequences of homologous proteins


may be compared. The rationale is that highly conserved amino acids contribute
more to the three dimensional structure than unconserved, and different weightings
can be introduced to the statistical analysis. **

• Improved accuracy: The accuracy of prediction has risen from about 55% using the
simple Chou-Fasman method, where the tendency is to overpredict, to almost 80%
using current methods.

55
http://www.chembio.uoguelph.ca/educmat/phy456/456lec01.htm
Amino acid patterns indicative of
-strand structures
• Short runs of conserved hydrophobic
– Buried -strand
• An i, i+2, i+4 pattern of conserved hydrophobic
residues suggests a surface -strand.
– Hydrophobic residues will face the interior of the
molecule, not the surface
• Conserved residues sharing the same
physicochemical properties are likely to form
one face of a strand.

56
Baxevanis & Ouellette (Ch. 12, Barton)
Amino acid patterns indicative of
-helical structures

• Conservation patterns of i, i+3, i+4, i+7 and variations


(e.g., i, i+4, i+7) suggests an alpha helix

• Amphiphilic/amphipathic conservation patterns


(alternating hydrophobic and polar residues) following
an i, i+3, i+4, i+7 pattern (and variations, e.g., i, i+4,
i+7) are likely to represent surface helices

57
Baxevanis & Ouellette (Ch. 12, Barton)
Identifying loop/coil regions

• Insertions and deletions are not well tolerated in


the hydrophobic core.
– Regions of an MSA that include many gap characters
are likely to indicate surface loops.
– Also look for small polar residues such as S
• Glycine and proline residues can be found in
any secondary structure.
– However, conserved glycine/proline residues are
strongly suggestive of loops.

58
Baxevanis & Ouellette (Ch. 12, Barton)
Amino acid preferences for different secondary structures
(and identifying loops/turns)

59
http://www.chembio.uoguelph.ca/educmat/phy456/456lec01.htm
Machine learning methods of
secondary structure prediction
• Based on machine learning
concepts
– Training set: learn implicit rules,
principles and model parameters
from labelled data (sequences
whose secondary structures are
known for each position)
– Used machine learning method
called artificial neural networks
(designed to simulate biological
neural networks in the brain)
– PHDsec (Rost et al 1994, Rost et
al 1996)

60
Baxevanis & Ouellette Ch 8 (Ofran and Rost)
Neural Network for
Protein Structure Prediction**

61
Ofran and Rost commentary
(from Baxevanis text)

62
Key to success in machine learning
algorithms**

• “The success of machine learning algorithms


depends on the careful choice of the biologically
based features used for training… and a sufficiently
large and accurate training set”
• To enhance prediction accuracy on novel data,
training data diversity is also critical
• Exploit knowledge that local environment is
important: to predict 2ary structure of residue ‘i’,
consider all residues in a window around i: i-n, … i,
… i+n.

63
Baxevanis & Ouellette Ch 8 (Ofran and Rost)
Assessing performance evaluations**

• “Overall, the correct evaluation of performance for


prediction methods is an art in itself; only a handful of
methods turned out over time to not have been
overestimated by their developers.”
– Evaluation must be performed on a standard dataset
– Training and test data should be rigorously kept separate
– Standard deviations of estimates should be provided

64
Baxevanis & Ouellette Ch 8 (Ofran and Rost)
Other problems with comparing
different methods**
• Performance reported in literature can take different forms
– Accuracy and coverage
– Positive (or negative) predictive power
– Sensitivity and specificity
– Machine learning terms (e.g., Matthews coefficients)
– Wilcoxon paired score signed rank tests

• Or might be based on different criteria for success


– per residue
– per secondary structure element
– per protein

• Others measure performance only in cases where a prediction has high


confidence (with a likelihood of a lower FP rate)

65
Baxevanis & Ouellette Ch 8 (Ofran and Rost)
How do the methods compare? **
• Best methods now reach 76% accuracy at 3-state
prediction (helix, strand, random coil)
– Rost 2001
– See EVA website for detailed comparisons
• Metaservers:
– Consensus approaches combining weighted predictions
from different servers
– These almost always outperform individual methods
– Shown in both CASP and EVA

66
Baxevanis & Ouellette Ch 8 (Ofran and Rost)
Caveats**
• Even when an experimental structure is available, it is
sometimes unclear where one secondary structure element
ends and another begins
• Low-confidence predictions (and regions of disagreement
across servers) can correspond to structurally ambiguous
regions
• Real-life example: Prion protein (involved in bovine spongiform
encephalopathy, Creutzfeld-Jakob disease, etc).
– Region assumed to be responsible for aggregation believed to flip from
experimentally determined helical structure to (predicted) strand in
diseased individuals
– All the best secondary structure prediction methods predict this region to
be beta (“incorrect”)

67
Baxevanis & Ouellette Ch 8 (Ofran and Rost)
Secondary structure prediction
programs
• PSI-PRED (David Jones; makes use of distant
homologs detected using PSI-BLAST - most popular)
• JNET (Cuff & Barton)
• PHD (Rost & Sander)

68
Baxevanis & Ouellette Ch 8 (Ofran and Rost)
PSIPRED

69
Consensus (or jury) methods

And metaserver techniques

70
Consensus and jury approaches
produce best results**
• Primary conclusion of CASP experiments:
– Structure prediction meta-servers (which combine results from several
independent prediction methods) have the highest accuracy
– improving both secondary and tertiary structure prediction
• This kind of consensus approach can be applied to both the template
selection and the pairwise alignment between the target and template
• Related to machine learning task of combining multiple weak learners into a
robust system
• Key trick: figuring out how to weight each prediction source (or “learner”),
and potentially to use a context-dependent weighting scheme
• HW2 (Fall 2016) asks you to use a consensus approach from multiple
webservers and databases to predict the multi-domain architecture of a
protein.

71
Example consensus approach
to secondary structure prediction

And additional summary


comments

72
Overview and issues***
• Overview: multiple secondary structure predictors combined into one consensus prediction
– In general: applied to a single sequence
– May exploit information from homologs (e.g., PSI-PRED, using PSI-BLAST to retrieve
and align homologs to distinguished sequence)
– A consensus is derived across these predicted labels (typically, a weighted majority
rule, in which the weights are determined on a reserved training set)
• Benchmarking: What is the gold standard for evaluating secondary structure prediction?
– 3D structure examination
• Limitations of using a simple majority rule consensus
– Smoothing techniques such as peer pressure or windowing are still needed
– Possible correlation among predictors (uniform weighting problematic)
• Other issues:
– Including homologs improves performance accuracy, but keep in mind that secondary
structures can diverge over evolutionary distance, so sequence weighting techniques
should upweight the contributions of more closely related sequences.

73
Figure 13.1, Pevsner
Secondary structure prediction for human beta globin
(HBB_HUMAN, NP_000539)
Evaluating secondary structure
prediction accuracy
• Compare Figure 13.1 (pg 592) of Pevsner showing secondary structure
prediction for human beta globin (hemoglobin subunit beta) with the
secondary structure labelling of the same sequence in SwissProt*
• Focus on the region from 21-35 (VDE…LVV)
• SwissProt: The region from 21-35 is labelled as helical (based on manual
examination of structural evidence; see next slide) –
– but not all positions in HBB_HUMAN have secondary structure labels!

• Consensus labelling: the core region (26-32) is predicted consistently as


helical (and is in the consensus), but flanking regions are often mislabelled
(as strand or random coil)
– Does it make structural sense to have a single residue that is helical in a region marked
as random coil? (Remember “peer pressure” technique described earlier)

75
SwissProt record for
HBB_HUMAN

Note: not all residues are labelled 76


SwissProt record for HBB_HUMAN
(residues 21-35 asserted to be helical)

77
Following the link to the supporting
evidence from structure

78
Note that this
structure has
several chains
(corresponding
to different
subunits)

79
PDB structural annotations

Bottom line: The second rectangle


from the left represents the helix
from 21-35.

80
3D-structure prediction

From Baxevanis & Ouellette Ch 8


(Ofran and Rost)
3D structure prediction**
• Decompose into two subtasks
– Fold assignment (or fold recognition) “Protein X is related by evolution to
structure Y”
• Assumed evolutionary relationship is used to infer a similarity in 3D fold (but no
comparative model construction)
• Can be achieved by pairwise sequence comparison, scoring a sequence against a
library of profiles or HMMs, and by other methods
• Newer “threading” methods can enable correct fold recognition in the Twilight Zone
– Comparative model construction
• May be restricted to higher sequence identity (e.g., above 30%) due to the
likelihood of serious alignment error below this range.
– Some servers do both
• 3d-pssm/PHYRE, Superfamily, etc.

82
Baxevanis & Ouellette Ch 8 (Ofran and Rost)
Threading**
• Limited to generating approximate models or suggesting
approximate folds
– >5 Angstroms for 3D threading
– >3Angstroms for 2D threading
• Name based on “threading” a tube (called a snake) through
a plumbing system.
• Each unique threading of a sequence through the 3D
model can be evaluated using empirically derived energy
function or measure of packing efficiency
• Sequences can be scored based on how well they fit the
model (i.e., the best score achievable)

83
Baxevanis & Ouellette Ch 9 (Wishart)
Three-dimensional threading***
• First described by Novotny et al (1984)
• Rediscovered in early 1990s
– Jones et al 2992; Sippl & Weitckus 1992; Bryant & Lawrence 1993
– Based largely on heuristic contact potentials (interactions between pairs of
residues)
– 3D coordinates of theoretical structure (based on threading of sequence
through PDB structure model) used to evaluate predicted contacts and derive a
fitness score based on a pseudoenergy function
• Powerful for predicting 3D structure of unknown proteins, and for
evaluating structure of known proteins
• Limitations found in this method:
– interactions are not always conserved between distant homologs
– Computational complexity (very slow)
– Modest accuracy (early methods ignored amino acid information; model
accuracy >5Angstroms)

84
Baxevanis & Ouellette Ch 9 (Wishart)
Contact maps**
• 2D plots of
distances between
C-alpha atoms of
all pairs of residues
– Observed
interactions
between amino
acids used to form
“contact potentials”
for 3D threading
methods

Figure 6.14 85
Creighton, Proteins Ch. 6
Two-dimensional threading**
• Sequence-profile methods; most often based exclusively on amino acid
similarity (e.g., PSI-BLAST, most HMMs) to score and align proteins
• Improved accuracy through combined use of 2ary structure prediction
(matching predicted secondary structure of target to predicted or known
secondary structure labels of template) and amino acid similarity;
• Predicted solvent accessibility is also occasionally included in these systems
(and compared against known structural environments)
• Advantage: Much faster than standard 3D threading
• Model accuracy good but not excellent (RMSD >3 Angstroms)
– However, for model construction for proteins with no close homologs with solved
structure, these methods are among the best
• Examples:
– UCSC SAMT99 (two-track HMMs), PHYRE, FUGUE

86
Baxevanis & Ouellette Ch 9 (Wishart)
Assessing method performance

• Astral benchmark datasets


– Park et al
• CASP experiments
• EVA and Livebench
– Continuous evaluation of webservers
– Still being used?
87
The EVA server

• Continuous assessment of the predictions of automatic servers


using the same measurements, the same standards, and the
same sequences to all methods
• New structures (pre-release to PDB) given to EVA by
participating structural biologists. EVA submits the amino acid
sequences to online servers.
• Predictions stored until release of 3D coordinates to PDB. Then
the predicted (2D or 3D) structures can be compared against
the solved structures, and given various scores.
• Approach enables the community to compare methods, and
gives developers concrete feedback that is critical for method
improvement.

88
Baxevanis & Ouellette Ch 8 (Ofran and Rost)
Critical Assessment of Protein
Structure Prediction (CASP)
Kryshtafovych et al, “Progress over the First
Decade of CASP Experiments” Proteins:
Structure, Function and Genetics 2005
90
Rosetta/Robetta

91
Red=first model
Yellow=models 2-5
Black=other groups

93
Red=first model
Yellow=models 2-5
Black=other groups

94
95
96
Selected protein structure
prediction servers
• Superfamily (Sequence-profile alignment; UCSD,
MRC/Cambridge, U. Bristol UK)
– http://supfam.org/SUPERFAMILY/index.html
• PHYRE (Profile-profile alignment; Imperial College of London)
– Recommended
– http://www.sbg.bio.ic.ac.uk/phyre/
• SwissModel (Swiss Institute of Bioinformatics)
– http://swissmodel.expasy.org//SWISS-MODEL.html
• MODBASE (precomputed models; Sali lab at UCSF)
– http://modbase.compbio.ucsf.edu/modbase-cgi/index.cgi

97
Summary (1)***
• Experimental determination of protein structure is
expensive and not always straightforward

• Predictive methods are relied upon to obtain clues to


protein fold (and function)

• Knowing what (which parts of a protein structure) you


can believe and what you can’t is critical for both
experimental and predicted structures

• Consensus and jury methods produce the best results


– E.g., protein structure prediction meta-servers

98
Summary (2) ***
• Ab initio methods of protein fold prediction use physics-based energy minimization to simulate
the process of protein folding
– These methods are generally less successful than homology-based fold prediction (limited to short
peptides/small proteins)
– Exception: Rosetta/I-sites methods (Baker group) which employ both types of approach

• Threading methods fall into the homology-based class of approaches.


– 2D profiles depend primarily on amino acid sequence similarity but may also use 2ary structure
(prediction/knowledge) to improve accuracy
– 3D profiles use 3D models and assign scores to proteins based on inter-residue contacts based on
the observed contacts in the original structure template and derived contact potentials from other
structures
• Problems: computationally expensive; methods that do not also incorporate sequence similarity to the family lack precision
(alignment quality can suffer)

– Threading approaches are sometimes attempted to predict structure for non-homologous molecules
(but this is rarely very successful)

99
Summary (3) ***
• Community assessment of 2D and 3D structure prediction uses various approaches
– EVA and LiveBench (continuous real-time assessment of methods)
– CASP (Critical Assessment of Protein Structure Prediction)
– Benchmark datasets (e.g., Astral PDB40 for fold recognition by Park et al)

• Reported accuracy of 2D structure prediction between 75-77% (for best methods)

• Reported accuracy of comparative models derived by 3D structure prediction servers is harder to


assess.

• Fold prediction (ignoring the comparative model construction) is fairly accurate for the best servers
provided
– A homologous structure has already been deposited in the PDB
– That structure can be detected with a significant E-value using sequence information alone, e.g., by PSI-
BLAST)

• The inclusion of 2ary structure prediction (e.g., in 2D profiles) can improve the alignment and give a
modest boost to fold recognition accuracy when %ID is very low, but must be integrated with
sequence similarity appropriately to avoid errors in prediction

100
Questions on the reading
1. What is the single most significant source of error in a comparative

model construction, if based on a template with <30% identity with the
David Baker and
target but that is detectable by BLAST (with a significant E-value)?
Andrej Sali, “Protein
Structure Prediction 2. What is an additional probable source of error if the percent identity
and Structural drops below 20% (detectable by PSI-BLAST with a significant E-value)?
Genomics” Science
3. What is the reason cited by Baker and Sali for why errors in a
2001
comparative model tend to not lie in functionally important sites such as
an enzyme active site?
4. What example do Baker and Sali give to demonstrate the utility of a low-
accuracy comparative model?
5. How would protein-protein interaction interfaces be predicted with a
comparative model?
6. What are the possible applications of a comparative model?
7. What fraction (approximately) of comparative models produced by
Rosetta for proteins <150 residues in length are considered accurate?
8. Does model refinement improve models or not? Under what conditions?

101

You might also like