You are on page 1of 85

Structure and Function Prediction

Why function prediction is required?

Protein sequence
-large numbers of
sequences, including
whole genomes

?
Knowledge of protein function is important in-
- rational drug design and treatment of disease
- protein and genetic engineering
- build networks to model cellular pathways
- study organismal function and evolution
What is the function of protein?

• What a protein binds?


• What reaction (if any) it catalyses?
• What pathway it belongs to?
• What general cellular function it participates in?
• Where it is found in the cell?
• What structure it has?
Current picture of protein function prediction
Methods of Protein Function prediction

Computational Approach For Function Prediction


Sequence Homology Search
BLAST, Clustal
Pattern Search
Motifs, eMOTIF, InterPRO
Structural Homology Search
Comparative Modelling
Threading
Ab Initio Method

Experimental Approach For Function Prediction


Gene Inactivation eg. knockouts
Gene Overexpression
cDNA Microarrays
Protein function prediction: Computational method

Full Genomes
(Genome
Databases)

Gene
Prediction

Protein Putative
Databases Proteins
(Known) (Unknown)

Protein
Search Annotated
Function
Method Proteins
Prediction
a protein whose existence has been predicted but for that protein there is no experimental
evidence. Putative protein :identified by Bioinformatics tools and servers
Classification of Protein Databases

Database : SWISS-PROT &


Sequence PIR etc.
Algorithms: BLAST, Clustal

Structure Pattern

Database : PDB & CATH etc. Database : PROSITE & Pfam etc.
Algorithms: Threading Algorithms: Motifs & InterPro
Sequence Homology Search
ispe hi saari game hai
Algorithms for Prediction of Protein Function
Link Functionally Related Proteins by

Experimental Related Metabolic Protein-Protein


Data Function Interaction maps

Related Rosetta Stone


Phylogenetic Conserved gene Method
Profiles clusters (Gene Fusion)

Conserved Regulatory Correlated m-RNA


Elements expression

Predict function of uncharacterized proteins using


links with characterized proteins
Algorithms Protein Function by Phylogenetic Profiles

Genomes
P2
P2 P4 P3 Phylogenetic Profiles
P5 P5
P1 EC SC BS HI
P7 P6
P7 P1 1 1 0 1
S. cerevisiae (SC) P2 1 1 1 0
B. subtilis (BS)
P3 1 0 1 1
P4 1 1 0 0
P1 P2 P3 P4
P1 P3 P5 1 1 1 1
P5 P6 P7
P6 P5 P6 1 0 1 1
E.coli (EC) H. influenzae P7 1 1 1 0
(HI)

Pellegrini et al, Proc. Natl. Acad. Sci. USA (1999) 96, 4285
Profile Clusters

P4 1 1 0 0

P2 1 1 1 0
P7 1 1 1 0

P1 1 1 0 1 P5 1 1 1 1

P3 1 0 1 1
P6 1 0 1 1

Conclusion: P2 & P7 are functionally linked


P3 & P6 are functionally linked
Combined Algorithm for prediction of Protein Function

Marcotte et al, Nature (1999) 402, 83-86


Gene Clusters to infer Functional Coupling

Overbeek et al, Proc. Natl. Acad. Sci. USA (1999) 96, 2896
Functional clusters in the glycolysis pathway
Using Protein Fusions to Predict Protein-Protein
Interactions

Fused A-B
Domains
A

Separate (coregulated) protein


“Rosetta Stones”

Marcotte et al, Science (1999) 285, 751-3


Examples of Protein Fusion
PSI-BLAST (Position Specific Iterative BLAST)
Detect weak relationships between the query and members of the
database not necessarily detectable by standard BLAST searches

Steps involved in running PSI-BLAST

• A profile (or position specific scoring matrix, PSSM) is


constructed (automatically) from a multiple alignment of the
highest scoring hits in an initial BLAST search. The PSSM is
generated by calculating position-specific scores for each
position in the alignment
• The profile is used to perform a second (etc.) BLAST search
and the results of each "iteration" used to refine the profile
PSI-BLAST (Position Specific Iterative BLAST)
• A PSI-BLAST query is identical to a BLAST query with added
specification by the user of the expectation (E) value cut-off for
inclusion of a match in the first and subsequent iterations. The E
value cut-off can always be over-ridden by the user on a case by
case basis if a sequence hit of interest is worse than the threshold.

• The initial PSI-BLAST search uses the same matrix options


available for BLAST, since it is a BLAST search.

• Each iteration of the search uses a position-specific substitution


matrix built from the search results of the previous iteration.

• The user can continue to search iteratively until satisfied that no


new matches will be identified. The point at which no new hits are
identified by additional searches is known as "convergence".
PSI-BLAST tutorial

Two objectives:
(1) to identify distant relatives of the MJ0577 family and
(2) to gain insight into the function of this family of proteins.

>gi|2501594|sp|Q57997|Y577_METJA PROTEIN MJ0577


MSVMYKKILYPTDFSETAEIALKHVKAFKTLKAEEVILLHVIDEREIKKRDIFSLLLGVAGLNKSVEEFE
NELKNKLTEEAKNKMENIKKELEDVGFKVKDIIVVGIPHEEIVKIAEDEGVDIIIMGSHGKTNLKEILLG
SVTENVIKKSNKPVLVVKRKNS

Database: ‘nr’
Expect: 1
Low complexity filter: On
Matrix: BLOSUM62
Descriptions: 250
Alignments: 100
Inclusion Threshold: 0.001
Examine Descriptions
In a PSI-BLAST search, hits are divided into two categories:
those with E better than threshold and others

• Those that are better than the E value threshold (listed first)
• Those with E values worse than threshold, but nonetheless
have an E value better than 1 (selected on the query page) are
listed further down the page

Hits with E values better than the threshold are used in forming the
profile that will be used in subsequent PSI-BLAST iterations

This hit corresponds to the database entry associated


with determination of the MJ0577 structure
The next set of entries (pink) are to orthologous sequences in
other Archaeal species

The green entries are to more distantly related Archaeal


sequences for the most part and also to bacteria
Any of the sequences in the list of "Sequences with E-value worse than
threshold" can be added to sequences used to generate the PSI-BLAST profile

Examine Pairwise Alignments

• If the aligned residues are predominantly hydrophobic, it may indicate


that a transmembrane or coiled-coil domain in the query is causing non-
specific hits

• The alignment of a short query sequence with a large protein indicates


that the query may share a domain in common with this large protein
FIRST ITERATION
Although the majority of these hits are unannotated, several have
known functions or resemble proteins with known functions

Pick up the proteins that are annotated and do a reverse PSI-


BLAST in order to filter out the true functional homolog, i.e., do
a PSI-BLAST using this protein (or gi no) and look at the
domains of similarity

for example:
MJ0577 shows similarity with the 780 aa cationic amino acid
transporter. On reverse BLAST :
• similarity to aa transporters resides between aa 45 and
aa 440, for the most part.
• Modest similarity to MJ0577 is localized to a separate
region of the protein, aa 550 - aa 720.
MJ0577 is probably a member of the Universal Stress Protein
Family or putative transcriptional regulator. Because,

• Similarity is reasonable
• Alignments are respectable

Perform reverse PSI-BLAST on this UspA protein


and putative transcriptional regulator protein
Reverse Blast on Transcriptional Regulator protein
Reverse Blast on universal stress response protein

The MJ0577 protein structural database homolog appears among


the significant hits, alongwith the MJ0577 relative, MJ0533

The fact that MJ0577 and uspA each bring up the other as a high
scoring hit in a PSI-BLAST search lends confidence to the
hypothesis that MJ0577 may be a distant relative of the uspA
family.
Second, Third & subsequent iterations
In the subsequent PSI-BLAST iterations the E-value
keeps on increasing and finally converges:
BLAST hit 2e-14
1st 6e-14
2nd 8e-38
3rd 3e-41
4th 1e-40
5th 4e-41
6th 3e-41
3rd, 4th, 5th, & 6th imply that the E-value has started
converging and has finally converged for UspA.
Therefore, MJ0577 may be a distant relative of the uspA
family.
Other Evidences

MJ0577 structure provides clues to its ATP binding property

The purified MJ0577 protein showed no appreciable ATPase


activity in vitro. However, when M. jannaschii extract was
added to the assay, ATP was significantly hydrolyzed to ADP
suggesting that the ATP hydrolysis activity of MJ0577 protein
requires additional components
A motif is similar 3-D structure conserved among different proteins that serves a similar function.

Domains, on the other hand, are regions of a protein that has a specific function and can (usually)
function independently of the rest of the protein.

Pattern / Profile Search

A protein domain is a conserved part of a given protein sequence and tertiary structure that can
evolve, function, and exist independently of the rest of the protein chain. Each domain forms a
compact three-dimensional structure and often can be independently stable and folded.
Function Prediction:pattern matching
Protein Sequence Motifs
PROSITE
Biologically-significant protein patterns and profiles
http://www.expasy.ch/prosite/
Pfam
Multiple sequence alignments and hidden Markov models of common
protein domains
http://www.sanger.ac.uk/Software/Pfam/
PRINTS
Protein sequence motifs and signatures
http://bioinf.man.ac.uk/dbbrowser/PRINTS/
COGs
Phylogenetic pattern based
http://www.ncbi.nlm.nih.gov/COG/
BLOCKS
Protein sequence motifs and alignments
http://www.blocks.fhcrc.org/
Function Prediction:pattern matching
Function Prediction:pattern matching
PROSITE database

PROSITE is a database of protein families and domains. It consists of


biologically significant sites, patterns and profiles that help to reliably
identify to which known protein family (if any) a new sequence belongs
Function Prediction:pattern matching

• Most of the different proteins can be grouped, on the basis of similarities in


their sequences, into a limited number of families.

• In protein sequence families, some regions have been found better conserved
than others during evolution.

• These regions are generally important for the function of a protein and/or for
the maintenance of its 3-dimensional structure.

• By analyzing the constant and variable properties of such groups of similar


sequences, it is possible to derive a signature for a protein family or domain,
which distinguishes its members from all other unrelated proteins.

• So it is a database of biologically significant sites and patterns formulated in


this way.
Function Prediction:pattern matching

Criterion for pattern creation

A good signature pattern -

• Must be as short as possible


• Should detect all or most of the sequences it is
designed to describe
• Should not give too many false positive results
• Must exhibit both high sensitivity and high specificity.
Function Prediction:pattern matching

Steps in the development of a new pattern

• Start by studying review(s) on a group or family of proteins


• Build an alignment table of the proteins discussed in that review
(also add new sequences if available)
• Look at the residues and regions thought to be important to the
biological function of that group of proteins e.g.,
• Enzyme catalytic sites.
• Prostethic group attachment sites (heme, pyridoxal-phosphate,
biotin, etc).
• Amino acids involved in binding a metal ion.
• Cysteines involved in disulfide bonds.
• Regions involved in binding a molecule (ADP/ATP, GDP/GTP,
calcium, DNA, etc.) or another protein.
Function Prediction:pattern matching

Steps in the development of a new pattern

• Try to find a short (not more than four or five residues long)
conserved sequence in this region (called ‘core’ pattern)
• Scan recent version of the SWISS-PROT knowledgebase with
these core pattern(s)
• If a ‘core’ pattern detects all the proteins under consideration and
none (or very few) of the other proteins, we can stop at this stage
and use the core pattern as a bona fide signature
Function Prediction:pattern matching

?
Methodology for the development
of profile entries
Function Prediction:pattern matching

Programs based on prosite patterns


InterPro Scan - Integrated search in PROSITE, Pfam, PRINTS and
other family and domain databases
MotifScan - Scans a sequence against protein profile databases
(including PROSITE)
Frame-ProfileScan - Scans a short DNA sequence against protein profile
databases (including PROSITE)
PPSEARCH - Scans a sequence against PROSITE
(allows a graphical output)
PROSITE scan - Scans a sequence against PROSITE
(allows mismatches)
MOTIFS -scans a sequence against PROSITE patterns (GCG soft.)
Function Prediction:pattern matching

Input example:
>gi|21264520 Tyrosine-protein kinase SRC-1 (p60-SRC-1)
MGATKSKPREGGPRSRSLDIVEGSHQPFTSLSASQTPNKSLDSHRPPAQPFGGNC
DLTPFGGINFSDTITSPQRTGPLAGGVTTFVALYDYESRTETDLSFKKGERLQIVN
NTEGDWWLARSLSSGQTGYIPSNYVAPSDSIQAEEWYLGKITRREAERLLLSLE
NPRGTFLVRESETTKGAYCLSVSDYDANRGLNVKHYKIRKLDSGGFYITSRTQFI
SLQQLVAYYSKHADGLCHRLTTVCPTAKPQTQGLSRDAWEIPRDSLRLELKLGQ
GCFGEVWMGTWNGTTRVAIKTLKPGTMSPEAFLQEAQVMKKLRHEKLVQLYA
VVSEEPIYIVTEYISKGSLLDFLKGEMGRYLRLPQLVDMAAQIASGMAYVERMN
YVHRDLRAANILVGENLVCKVADFGLARLIEDNEYTARQGAKFPIKWTAPEAAL
YGRFTIKSDVWSFGILLTELTTKGRVPYPGMVNREVLDQVERGYRMPCPPDCPE
SLHDLMFQCWRKDPEERPTFEYLQAFLEDYFTATEPQYQPGDNL
Function Prediction:pattern matching

Output
gi-21264520.pep Check: 6666 Length: 532 ! gi|21264520 Tyrosine-protein kinase SRC-1

Protein_Kinase_Atp (L,I,V)G~(P)G~(P)(F,Y,W,M,G,S,T,N,H)(S,G,A)~(P,W)(L,I,V,C,A,T)
~(P,D)x(G,S,T,A,C,L,I,V,M,F,Y)x{5,18}(L,I,V,M,F,Y,W,C,S,T,A,R)(A,I,V,P)(L,I,V,M,F,
A,G,C,K,R)K(L)G~PG~P(F)(G)~(P,W)(V)~(P,D)x(G)x{7}(V)(A)(I)K
PROSITE allows the following pattern elements

For example:
[FILV]Qxxx[RK]Gxxx[RK]xK{ST}x[FILVWY]{P}

where
x signifies any amino acid,
square brackets indicate an alternative
A string of characters drawn from the alphabet and enclosed in braces (curly
brackets) denotes any amino acid except for those in the string. For
example, {ST} denotes any amino acid other than S or T.

If a pattern is restricted to the N-terminal of a sequence, the pattern is prefixed


with '<'.
If a pattern is restricted to the C-terminal of a sequence, the pattern is suffixed
with '>'.
Function Prediction:pattern matching

Pfam is a large collection of multiple sequence alignments


and hidden Markov models covering many common protein
domains and families.
Pfam:
Pfam-A : curated part of Pfam in which protein
sequences have at least one match to Pfam.
Pfam-B : large number of small families taken from
the PRODOM database that do not overlap
with Pfam-A.

• Sequence coverage pfam-A : 73%


• Sequence coverage pfam-B : 20%
• Others
Function Prediction :Pattern Matching

For each family in Pfam you can find:


• Multiple alignments
• Pfam domain architectures
• Non Pfam domain annotations (transmembrane region
location and low complexity eg.)
• Known protein structures
• Active site information
• Species distribution
Function Prediction :Pattern Matching

Input example:

>gi|21264520 Tyrosine-protein kinase SRC-1 (p60-


SRC-1)
MGATKSKPREGGPRSRSLDIVEGSHQPFTSLSASQTPNKSLDSHRPPAQPFGGNCDLTP
FGGINFSDTITSPQRTGPLAGGVTTFVALYDYESRTETDLSFKKGERLQIVNNTEGDWW
LARSLSSGQTGYIPSNYVAPSDSIQAEEWYLGKITRREAERLLLSLENPRGTFLVRESE
TTKGAYCLSVSDYDANRGLNVKHYKIRKLDSGGFYITSRTQFISLQQLVAYYSKHADGL
CHRLTTVCPTAKPQTQGLSRDAWEIPRDSLRLELKLGQGCFGEVWMGTWNGTTRVAIKT
LKPGTMSPEAFLQEAQVMKKLRHEKLVQLYAVVSEEPIYIVTEYISKGSLLDFLKGEMG
RYLRLPQLVDMAAQIASGMAYVERMNYVHRDLRAANILVGENLVCKVADFGLARLIEDN
EYTARQGAKFPIKWTAPEAALYGRFTIKSDVWSFGILLTELTTKGRVPYPGMVNREVLD
QVERGYRMPCPPDCPESLHDLMFQCWRKDPEERPTFEYLQAFLEDYFTATEPQYQPGDN
L
Function Prediction :Pattern Matching

Output
Starting search. Estimated time: 14 seconds (assuming all Wulfpack nodes are running).
Please wait...
Pfam HMM search results, glocal+local alignments merged (Pfam_ls+Pfam_fs)
[Go here for an explanation of the format of the results]

Model Seq-from Seq-to HMM-from HMM-to Score E-value Alignment Description


!! SH3 83 139 1 58 99.2 5.5e-27 glocal SH3 domain
!! SH2 147 229 1 79 140.5 2e-39 glocal SH2 domain
!! Pkinase 266 515 1 294 265.0 6.7e-77 glocal Protein kinase
domain
Function Prediction :Pattern Matching

Alignments of top-scoring domains:


Function Prediction :Pattern Matching

Alignment of query with seed


Function Prediction :Pattern Matching

Description:
Function Prediction :Pattern Matching

Links to other Databases

• PROSITE
• PRINTS
• PDB
• BLOCKS
• SMART
• PRODOM
• InterPro &
• CDD
Function Prediction :Pattern Matching

• PRINTS is a compendium of protein fingerprints.


• A fingerprint is a group of conserved motifs used to
characterize a protein family.
• Usually fingerprints do not overlap with motifs.
• Fingerprints can encode protein folds and
functionalities more flexibly and powerfully than can
single motifs,
Function Prediction :Pattern Matching

• It detects the distant relationships of proteins.

• It is based on the multi-conserved motifs approach.

• It is better than prosite in certain areas eg. Transmembrane


proteins and globular proteins.

• Release 39.0 of PRINTS contains 1950 entries, encoding 11,625


individual motifs. ADVANTAGES OVER OTHERS

• Overall, the database is still relatively small, largely because the detailed
annotation of fingerprints is so time-consuming. Nevertheless, the
additional, unique information contained in PRINTS makes this database
a useful adjunct to PROSITE, Pfam, Blocks, etc..
Function Prediction :Pattern Matching

Input example:

>gi|21264520 Tyrosine-protein kinase SRC-1 (p60-


SRC-1)
MGATKSKPREGGPRSRSLDIVEGSHQPFTSLSASQTPNKSLDSHRPPAQPFGGNCDLTP
FGGINFSDTITSPQRTGPLAGGVTTFVALYDYESRTETDLSFKKGERLQIVNNTEGDWW
LARSLSSGQTGYIPSNYVAPSDSIQAEEWYLGKITRREAERLLLSLENPRGTFLVRESE
TTKGAYCLSVSDYDANRGLNVKHYKIRKLDSGGFYITSRTQFISLQQLVAYYSKHADGL
CHRLTTVCPTAKPQTQGLSRDAWEIPRDSLRLELKLGQGCFGEVWMGTWNGTTRVAIKT
LKPGTMSPEAFLQEAQVMKKLRHEKLVQLYAVVSEEPIYIVTEYISKGSLLDFLKGEMG
RYLRLPQLVDMAAQIASGMAYVERMNYVHRDLRAANILVGENLVCKVADFGLARLIEDN
EYTARQGAKFPIKWTAPEAALYGRFTIKSDVWSFGILLTELTTKGRVPYPGMVNREVLD
QVERGYRMPCPPDCPESLHDLMFQCWRKDPEERPTFEYLQAFLEDYFTATEPQYQPGDN
L
Function Prediction :Pattern Matching

Output
Function Prediction :Pattern Matching

Top scoring fingerprints for query.


Structural Homology Search
Structural Classification of Proteins (SCOP)
• Class : Types of folds, eg. beta sheets
• Fold: Different shapes of domains within a class
– Similar topology
• Superfamily : domains in a fold are grouped into superfamilies
– Probable common evolutionary origin
– Generally low sequence identity
– Structure and function suggest a common origin
• Family: Domains in a family are grouped into families
– Clear evolutionary relationship
– Pairwise sequence identities are generally >30%
– Function is also used as criteria
• PDB Domains
Searching for "Subtilisin" brings up the protein, "Subtilisin from Bacillus subtilis, carlsberg", with the
following lineage.
Lineage:1. Root: scop
2. Class: Alpha and beta proteins (a/b) [51349]Mainly parallel beta sheets (beta-alpha-beta units)
3. Fold: Subtilisin-like [52742]3 layers: a/b/a, parallel beta-sheet of 7 strands, order 2314567; left-handed
crossover connection between strands 2 & 3
4. Superfamily: Subtilisin-like [52743]
5. Family: Subtilases [52744]
6. Protein: Subtilisin [52745]7.
Species: Bacillus subtilis, carlsberg [TaxId: 1423] [52746]
Scop Structure
Protein classification by structure

• Proteins are broadly classified by types of secondary


structures and their arrangement into 6+ CLASSES of
topology

• There are several research groups that have classified


structures into similar CLASSES – SCOP, CATH and
FSSP are examples
SCOP CLASSES
Class 3. Class a /b class is
Class 1. Class a is comprised of a comprised of mainly parallel b
bundle of a helices connected by loops sheets with intervening a helices
on the surface of the proteins.

2. Class b is comprised of antiparallel b Class 4. Class a + b class comprised mainly of


sheets, usually two sheets in close segregated a helices and antiparallel b sheets.
contact forming a sandwich.
SCOP classes (2)

Class 5. Multi-domain (a and b) proteins comprised of domains


representing more than one of the above four classes.

Class 6. Membrane and cell surface proteins and peptides


excluding proteins of the immune system.
Transmembrane Region Prediction
• DAS Prediction of transmembrane alpha-helices
www.sbc.su.se/~miklos/DAS/

• HMMTOP An automatic server for predicting


transmembrane helices and topology of proteins
www.enzim.hu/hmmtop/

• Predictprotein A service for sequence analysis, and


structure prediction.
dodo.cpmc.columbia.edu/predictprotein/

• TMAP www.mbb.ki.se/tmap/

• Toppred2 Topology prediction of membrane proteins.


www.sbc.su.se/~erikw/toppred2/
From Sequence to Structure

• Relationship between sequence and 3-


dimensional structure

• Start with the sequence and try to predict the


3D structure

• Methods for structure prediction

• Structural Databases
Relationship between sequence and structure

• There must be a relationship!


– given any unique sequence this will give a unique
3-dimensional structure
• The relationship is not simple!
– if it were then prediction of structure would be
straightforward
• Complexity arises as a result of the large number of atomic
interactions (each with small G) which stabilize any
given 3-dimensional structure
The relationship between sequence and
structure boils down to what mutations the
structure can tolerate without
unfolding/aggregating/precipitating or
becoming inactive
Many conformations of the amino
acid chain are possible due to the
rotation of the chain about each Cα
atom.

Example:
Glycine takes on a special position, as
it has the smallest side chain, only
one hydrogen atom, and therefore
can increase the local flexibility in the
protein structure.

Cysteine on the other hand can react


with another cysteine residue and
thereby form a cross link stabilizing
the whole structure.
Protein structure prediction CASP

• Critical Assessment of protein Structure Prediction


• CASP3 1998, CASP4 December 2000
• <1% of known protein sequences have an
experimentally determined structure
• CASP4
– 11,136 models submitted
– 43 different targets
– 163 different research groups (20 countries)
CASP4

• Comparative modeling
– Clear sequence relationship with an experimentally
determined structure
• Fold recognition
– No easily detectable sequence relationship
– But the fold is known
• Ab initio prediction
– No direct use of known structures
Comparative Modeling
Tertiary Structure Prediction Methods

Any given protein sequence

Compare sequence with proteins have solved structure

> 35% < 35% < 35%


Homology Fold ab initio
Modeling Recognition Folding

Structure selection

Structure refinement

Final Structure
Comparative Modeling

• Goal is to compete with experimental methods


Approximate structure is guaranteed
• Sequence Alignment
– Correct alignment is central to this approach
– Multiple sequence and structure alignments
– Threshold appears to be 30% sequence identity
• Sidechain conformation
– Database methods and limited refinement
Alpha helix and beta sheets
• The α helix has 3.6 amino acids
per turn with an H bond formed
between every fourth residue;
the average length is 10 amino
acids (3 turns) or 10 Å but
varies from 5 to 40 (1.5 to 11
turns).

• β sheets are formed by H bonds


between an average of 5–10
consecutive amino acids in one
portion of the chain with
another 5–10 farther down the
chain.
Comparative Modeling

• Loops
– Database methods and local ab initio approaches
• Refinement
– Main chain conformation produced by copying the
template is approximate (rmsd ~1.5 Å)
– Need some version of energy minimisation
– Still very poor at prediction

• The root-mean-square deviation (RMSD) is the measure of the average


distance between the atoms (usually the backbone atoms)
of superimposed proteins.
• On average, smaller r.m.s.d. values are associated with protein structure pairs
at better resolution
• very high resolution structures (1.8 Å or better)
Homology Modelling
• Need alignment
– Profile methods
– Multiple possible alignments, can split or average
• Build model
• Refine loops
– Database methods
– Random conformation
– Score: best using a real force field (amber, charmm, gromos)
• Refine sidechains
– Works best in core residues
– Often based on rotamer libraries from pdb

Rotamers are usually defined as low energy side-chain conformations. The use of a
build-library of rotamers allows anyone determining or modeling a structure to try the
most likely side-chain conformations, saving time and producing a structure that is more
likely to be correct.
Looking at Structures: Resolution
• Resolution is a measure of the quality of the data that has been collected on the
crystal containing the protein or nucleic acid.
• If all of the proteins in the crystal are aligned in an identical way, forming a very
perfect crystal, then all of the proteins will scatter X-rays the same way, and the
diffraction pattern will show the fine details of crystal. On the other hand, if the
proteins in the crystal are all slightly different, due to local flexibility or motion,
the diffraction pattern will not contain as much fine information.
• So resolution is a measure of the level of detail present in the diffraction pattern
and the level of detail that will be seen when the electron density map is
calculated.
• High-resolution structures, with resolution values of 1 Å or so, are highly ordered
and it is easy to see every atom in the electron density map.
• Lower resolution structures, with resolution of 3 Å or higher, show only the basic
contours of the protein chain, and the atomic structure must be inferred. Most
crystallographic-defined structures of proteins fall in between these two extremes.
• As a general rule of thumb, we have more confidence in the location of atoms in
structures with resolution values that are small, called "high-resolution structures".
Target-Multiple Template Alignment

• Alignment is prepared by superimposing all


template structures
• Add target sequence to this alignment
• Compare with multiple sequence alignment and
adjust
Swiss-Model

• Fully automated homology modeling


• BLAST on PDB
• Multiple sequence alignment incorporating multiple
structure alignments
• Add the target sequence to the alignment
• Model building
Model Building

• Build an averaged framework from superimposed template


structures
• Generate atomic coordinates for your sequence using the
alignment
• Rebuild non-conserved loops
• Complete the mainchain
• Reconstruct the side-chains (using allowed rotomers)
• Energy optimisation
– 500 cycles of gradient energy minimisation
ModBase
• http://pipe.rockefeller.edu/modbase-cgi/index.cgi
• A database of annotated comparative protein structure models
• Uses PSI-BLAST and MODELLER
• 26,814,632 “reliable” models from 50,441 pdb structures
• Automated
– Fold assignment (PSI-BLAST)
– Sequence/structure alignment (ALIGN2D)
– Model building (MODELLER)
– Model evaluation
ModWeb
A Server for Protein Structure Modeling
http://modbase.compbio.ucsf.edu/ModWeb20-html/modweb.html
Errors in Homology Modeling

a) Side chain packing b) Distortions and shifts c) no template


Errors in Homology Modeling

d) Misalignments e) incorrect template


Marti-Renom et al., Ann. Rev. Biophys. Biomol. Struct., 2000, 29:291-325.

You might also like