Lecture 3

Structural and Functional Bioinformatics
Lecture 3
For 3D structure prediction there exist two basic
approaches:
(1)compare the structure with proteins with known
structure, or
(2)to predict the structure just from the sequence
including physical laws and empirical knowledge.
Item (1) can be subdivided into comparative modeling by

(1.1) sequence-sequence comparison (alignment) and
comparative modeling by (1.2) sequence-structure
comparison (threading).
If after a global sequence alignment the identity between
the proteins is 25-45%, then the two structures are similar.
When the similarity is about 45%, then structures are equal,
i.e. their structure match exactly.
For (1.1) alignment methods like BLAST are used and for
(1.2) different threading methods are introduced.
Threading methods put a new sequence on a known

structure and compute how well the new sequence fits the
known structure, e.g. how many hydrophobic amino acids
are buried.
Item (2) includes “ab initio prediction” and molecular
modeling or quantum mechanical modeling.
Tertiary structure
• Tertiary structure refers to three-dimensional structure of

a single protein molecule. The alpha-helices and beta-
sheets are folded into a compact globule.
• The folding is driven by the non-specific

hydrophobic interactions (the burial of
hydrophobic residues from water), but the structure is
stable only when the parts of a protein domain are
locked into place by specific tertiary interactions, such
as salt bridges, hydrogen bonds, and the tight packing of
side chains and disulfide bonds.
PROTEIN MODELING
• Predicting the 3D structure of a protein

from its amino acid sequence
Computational methods for Protein
Modeling
Homology or Comparative Modeling
Fold Recognition or threading Methods
Ab initio methods that utilize knowledge-based

information or without the aid of knowledge-based
information
Why do we need computational
approaches?
The goal of research in the area of structural genomics is to provide the means to characterize and
identify the large number of protein sequences that are being discovered
Knowledge of the three-dimensional structure

 helps in the rational design of site-directed mutations
 can be of great importance for the design of drugs
 greatly enhances our understanding of how proteins function and how they interact with each
other , for example, explain antigenic behaviour, DNA binding specificity, etc
Structural information from x-ray crystallographic or NMR results

 obtained much more slowly.
 techniques involve elaborate technical procedures
 many proteins fail to crystallize at all and/or cannot be obtained or dissolved in large enough
quantities for NMR measurements
 The size of the protein is also a limiting factor for NMR
With a better computational method this can be done extremely fast.

Protein Homology modeling
• Homology modeling is an extrapolation of protein structure for a

target sequence using the known 3D structure of similar sequence as
a template.
• Basis: proteins with similar sequences are likely to assume same
folding
• Certain proteins with as low as 25% similarity have been observed to
assume same 3D structure
03/08/20 09:08
Bioinformatics how to
…
use publicly available free tools to predict protein structure by
comparative modeling
Proteins are 3D objects with
complex shapes
• Over 60,000 protein structures have been determined,

mostly by X-ray crystallography (PDB)
• 3D structure of ~70% of bacterial and 50% of human
proteins can be predicted (comparative modeling)
A predicted model simply illustrates
our assumptions
No assumptions, this
is nature telling us
how it is
GNAAAAKKGSEQESVKEFLAKAKEDFLKKWENPA
QNTAHLDQFERIKTLGTGSFGRVMLVKHKETGNH
FAMKILDKQKVVKLKQIEHTLNEKRILQAVNFPF
LVKLEYSFKDNSNLYMVMEYVPGGEMFSHLRRIG
RFSEPHARFYAAQIVLTFEYLHSLDLIYRDLKPE
NLLIDQQGYIQVTDFGFAKRVKGRTWTLCGTPEY
LAPEIILSKGYNKAVDWWALGVLIYEMAAGYPPF
FADQPIQIYEKIVSGKVRFPSHFSSDLKDLLRNL
LQVDLTKRFGNLKDGVNDIKNHKWFATTDWIAIY
QRKVEAPFIPKFKGPGDTSNFDDYEEEEIRVSIN
EKCGKEFSEF
Sequence Assumption Result

(protein A is Similar (protein A is Similar
to protein B) to protein B)
How do we know that these
proteins are similar?
Well studied protein
• Unknown protein • SRRSASHPTYSEMIAAAIRAEK
• GLLTTKFVSLLQEAKDGVLDL SRGGSSRQSIQKYIKSHYKVG
KLAADTLAVRQKRRIYDITNVL HNADLQIKLSIRRLLAA
EGIGLIEKKSKNSIQW similarity
prediction
How can we make such
assumptions?
• Statistical reliability of the prediction
• E-value - the number of hits one can "expect" to see just
by chance when searching a database of a particular size
(closer to zero the better)
• Z-score – score expressed as a distance from the mean
calculated in standard deviations (the bigger the better)
Similar, but not homologous
• Phosphoribosyltransferase and viral coat protein, identity: 42%,

different folds, different functions
• . . . . .
99 IRLKSYCNDQSTGDIKVIGGDDLSTLTGKNVLIVEDIIDTGKTMQTLLSLVRQY.NPKMVKVASLLVKRTPRSVGY 173
: ||. ||| || |. || | : | | | | || | || |:| | ||.| |
214 VPLKTDANDQ.IGDSLY....SAMTVDDFGVLAVRVVNDHNPTKVT..SKVRIYMKPKHVRV...WCPRPPRAVPY 279
•
Different, but homologous
• Histone H5 and transcription factor E2F4, identity 7%, similar fold, similar
function (DNA binding)
• PTYSEMIAAAIRAEKSRGGSSRQSIQKYIKSHYKVGHNADLQIKLSIRRLLAAGVLKQTKGVGASGSFRL
| | | | |
• GLLTTKFVSLLQEAKD-GVLDLKLAADTLA------VRQKRRIYDITNVLEGIGLIEKKS----KNSIQW
COMPARATIVE or HOMOLOGY MODELING
The aim is to build a 3-D model for a protein of unknown structure
(target) on the basis of sequence similarity to proteins of known
structure (templates).
•Accuracy ranges from getting correct fold to high resolution

•The most accurate structure prediction method Why?
•Accuracy ranges from getting correct fold to high resolution

•The most accurate structure prediction method Why?
-- 3D structures of proteins in a given family are
more conserved than their sequences
-- ~1/3 of all sequences are recognizably related to at
least one known structure
-- the number of unique protein folds is limited
Structure prediction flowchart:
Rob Russell http://speedy.embl-heidelberg.de/gtsp/

Next Class on Monday September
30
Steps
Identify template(s)
– Initial alignment
• Improve alignment
• Backbone
generation
• Loop modelling
• Side chains
• Refinement
• Validation
Steps in Homology Modeling
Figure 5.1.1 from MA Marti-Renom and A. Sali “Modeling Protein

Structure from Its Sequence” Current Prototocols in
Bioinformatics (2003). 5.1.1-5.1.32
Step 1: Template Selection
Explore the Following web links
for structure prediction exercise
Data Bases
GenBank www.ncbi.nlm.nih.gov/GenBank
GeneCensus bioinfo.mbb.yale.edu/genome
MODBASE http://modbase.compbio.ucsf.edu
PDB www.rcsb.org/pdb/
eMOTIF http://brutlag.stanford.edu/projects.html
UniProt http://www.uniprot.org/
Data Bases
Data Bases
Data Bases
Data Bases
Data Bases
Data Bases
Template Search
BLAST http://www.ncbi.nlm.nih.gov/BLAST/
FastA http://www.ebi.ac.uk/fasta33/
SSM http://www.ebi.ac.uk/msd-srv/ssm/
PredictProtein http://www.predictprotein.org/
123D; SARF2; PDP http://123d.ncifcrf.gov/
GenTHREADER http://bioinf.cs.ucl.ac.uk/psipred/
UCLA-DOE http://fold.doe-mbi.ucla.edu/
Core
Template Search
→ BLAST
The Basic Local Alignment Search Tool (BLAST) finds regions of

local similarity between sequences. The program compares
nucleotide or protein sequences to sequence databases and
calculates the statistical significance of matches.
Template Search
Fasta Protein Database Query
Provides sequence similarity searching against nucleotide and protein databases

using the Fasta programs. Fasta can be very specific when identifying long regions
of low similarity especially for highly diverged sequences. You can also conduct
sequence similarity searching against complete proteome or genome databases
using the Fasta programs.
Template Search
Template Search
Template Search
Thread 1- to 3-D with 123D+
123D+ is a program which combines sequence profiles, secondary structure

prediction, and contact capacity potentials to thread a protein sequence
through the set of structures.
Template Search
PSIPRED - a highly accurate method for protein secondary structure prediction

MEMSAT2 - our widely used transmembrane topology prediction method and
GenTHREADER - a sequence profile based fold recognition method.
Template Search
FAD or NADP binding Rossmann fold detector

Fold Recognition
Internal Repeat Finder
MOMENT Transmembrane Helix Prediction
Motif-Based Fold Assignment
DPANN: Sequence to Structure Alignment
Profile Search Software: Bowie et al. 1991
Step 2: Sequence Alignment
!!
This is the most crucial step in the process.
The process of homology modeling can not
recover from a bad alignment.
Sequence Alignment
EMBOSS http://www.ebi.ac.uk/emboss/align/
Tcoffee http://www.igs.cnrs-mrs.fr/Tcoffee
ClustalW http://www.ebi.ac.uk/clustalw/
BCM http://searchlauncher.bcm.tmc.edu/multi-align/
POA http://www.bioinformatics.ucla.edu/poa/
STAMP http://www.ks.uiuc.edu/Research/vmd/
SwissModel http://www.expasy.org/spdbv/
Core
Sequence Alignment
Pairwise Alignment Algorithms
This tool is used to compare 2 sequences. When you want an alignment that
covers the whole length of both sequences, use needle. When you are trying to
find the best region of similarity between two sequences, use water.
Sequence Alignment
This program is more accurate than ClustalW for sequences with less than
30% identity, but it is slower...
Sequence Alignment
ClustalW is a general purpose multiple sequence alignment program for DNA or

proteins. It calculates the best match for the selected sequences, and lines
them up so that the identities, similarities and differences can be seen.
Evolutionary relationships can be seen via viewing Cladograms or Phylograms
Sequence Alignment
Sequence Alignment
Multiple Sequence Alignment

Using Partial Order Graphs
Bioinformatics 2002 18:452-464Christopher Lee, Catherine Grasso, & Mark Sharlow
Sequence Alignment
STAMP is a suite of programs for the comparison and alignment of protein

three dimensional structures.
Sequence Alignment
Deep View
Swiss-PdbViewer
by
Nicolas Guex, Alexandre Diemand , Manuel C. , &
Torsten Schwede
Threading
A.A. Substitution Matrix
A C D E F G H I K L M N P Q R S T V W Y
A 5 -2 0 1 -2 0 0 -1 0 -1 0 0 1 0 -1 1 0 0 -2 -2
C -2 8 -2 -3 -3 -2 0 -2 -3 -3 0 -2 -3 -3 -2 -1 -1 -2 -1 -2
D 0 -2 5 2 -2 0 1 -3 0 -2 -1 2 0 1 -2 0 0 -2 -3 -2
E 1 -3 2 5 -3 0 -1 -2 1 -2 -2 1 1 2 0 1 1 -1 -2 -1
F -2 -3 -2 -3 6 -3 1 0 -3 2 2 -3 -2 -3 -2 -1 -2 0 3 3
G 0 -2 0 0 -3 5 -1 -2 0 -2 -2 0 0 -1 0 0 -1 -1 -2 -3
H 0 0 1 -1 1 -1 5 -1 1 -1 0 1 0 1 2 0 1 -1 0 1
I -1 -2 -3 -2 0 -2 -1 5 -2 2 2 -2 -2 -3 -2 -1 0 2 9 0
K 0 -3 0 1 -3 0 1 -2 5 -1 -2 1 0 1 2 0 0 -1 -2 -2
L -1 -3 -2 -2 2 -2 -1 2 -1 5 3 -2 -2 0 -1 -1 0 2 0 0
M 0 0 -1 -2 2 -2 0 2 -2 3 5 -1 -2 0 -2 -1 0 1 -2 -1
N 0 -2 2 1 -3 0 1 -2 1 -2 -1 5 -2 1 0 2 0 -2 -3 -1
P 1 -3 0 1 -2 0 0 -2 0 -2 -2 -2 8 0 0 0 0 -1 -3 -3
Q 0 -3 1 2 -3 -1 1 -3 1 0 0 1 0 5 2 1 0 -1 -1 -2
R -1 -2 -2 0 -2 0 2 -2 2 -1 -2 0 0 2 5 1 0 -1 0 -1
S 1 -1 0 1 -1 0 0 -1 0 -1 -1 2 0 1 1 5 2 -1 0 0
T 0 -1 0 1 -2 -1 1 0 0 0 0 0 0 0 0 2 5 0 -1 -2
V 0 -2 -2 -1 0 -1 -1 2 -1 2 1 -2 -1 -1 -1 -1 0 5 -1 0
W -2 -1 -3 -2 3 -2 0 9 -2 0 -2 -3 -3 -1 0 0 -1 -1 6 3
Y -2 -2 -2 -1 3 -3 1 0 -2 0 -1 -1 -3 -2 -1 0 -2 0 3 6
F↔F 6 F↔Y 3 F ↔ K -3
Alignment Matrix
V A T T P D K S W L T V
A
S
0
-1
5
1
0
2
0
2
1
0
0
0
0
0
1
5
-2
0
-1
-1
0
2
0
-1
Sequence A:
T 0 0 5 5 0 0 0 2 -1 0 5 0 VATTPDKSWLTV
P -1 1 0 0 8 0 0 0 -3 -2 0 -1
E -2 1 1 1 1 2 1 1 -2 -2 1 -2
R -1 -1 0 0 0 -2 2 1 0 -1 0 -1 Sequence B:
A 0 5 0 0 1 0 0 1 -2 -1 0 0
S -1 1 2 2 0 0 0 5 0 -1 2 -1
ASTPERASWLGTA
W -1 -2 -1 -1 -3 -3 -2 0 6 0 -1 -1
L 2 -1 0 0 -2 -2 -1 -1 0 5 0 2
G -1 0 -1 -1 0 0 0 0 -2 -2 -1 -1
T 0 0 5 5 0 0 0 2 -1 0 5 0
A 0 5 0 0 1 0 0 1 -2 -1 0 0
VATTPDK-SWLTV- VATTPDK-SWL-TV
|*||** ||| |*||** ||| |* Core
-ASTPERASWLGTA -ASTPERASWLGTA
score 39 score 45
Multiple Sequence Alignment
Sequence A: LTLTLTLT- -LTLTLTLT

LTLTLTLT HAHAHAHAH HAHAHAHAH
Sequence B: score -4 score 0
HAHAHAHAH
Sequence C:
THTHTHTHT -LTLTLTLT-
| | | |
THTHTHTHT-
| | | |
-HAHAHAHAH
The third sequence from a homologous protein allows alignment.

It’s a very good idea to have more than one template.

Lecture 3

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 3

Uploaded by

Copyright:

Available Formats

Structural and Functional Bioinformatics

Item (1) can be subdivided into comparative modeling by

Threading methods put a new sequence on a known

• Tertiary structure refers to three-dimensional structure of

• The folding is driven by the non-specific

• Predicting the 3D structure of a protein

Homology or Comparative Modeling

Fold Recognition or threading Methods

Ab initio methods that utilize knowledge-based

Knowledge of the three-dimensional structure

Structural information from x-ray crystallographic or NMR results

With a better computational method this can be done extremely fast.

• Homology modeling is an extrapolation of protein structure for a

• Over 60,000 protein structures have been determined,

Sequence Assumption Result

• Phosphoribosyltransferase and viral coat protein, identity: 42%,

•Accuracy ranges from getting correct fold to high resolution

•Accuracy ranges from getting correct fold to high resolution

Rob Russell http://speedy.embl-heidelberg.de/gtsp/

Figure 5.1.1 from MA Marti-Renom and A. Sali “Modeling Protein

The Basic Local Alignment Search Tool (BLAST) finds regions of

Fasta Protein Database Query

Provides sequence similarity searching against nucleotide and protein databases

Thread 1- to 3-D with 123D+

123D+ is a program which combines sequence profiles, secondary structure

PSIPRED - a highly accurate method for protein secondary structure prediction

FAD or NADP binding Rossmann fold detector

Pairwise Alignment Algorithms

ClustalW is a general purpose multiple sequence alignment program for DNA or

Multiple Sequence Alignment

STAMP is a suite of programs for the comparison and alignment of protein

Sequence A: LTLTLTLT- -LTLTLTLT

The third sequence from a homologous protein allows alignment.

You might also like