UGUR

Computational Methods in
Molecular Modelling
Uğur Sezerman
Biological Sciences and Bioengineering Program
Sabancı University, Istanbul
Motivation
Knowing the structure of molecules

enables us to understand its mechanism
of function
Current experimental techniques
X-ray cystallography
NMR
PROTEIN FOLDING
PROBLEM
STARTING FROM AMINO ACID SEQUENCE
FINDING THE STRUCTURE OF PROTEINS IS
CALLED THE PROTEIN FOLDING PROBLEM
Forces driving protein
folding
It is believed that hydrophobic collapse is

a key driving force for protein folding
Hydrophobic core
Polar surface interacting with solvent
Minimum volume (no cavities) Van der
Walls
Disulfide bond formation stabilizes
Hydrogen bonds
Polar and electrostatic interactions
SECONDARY STRUCTURE
PREDICTION
Intro. To Struc.
(Tooze and Branden)
Secondary Structure
Prediction
AGVGTVPMTAYGNDIQYYGQVT…
A-VGIVPM-AYGQDIQY-GQVT…
AG-GIIP--AYGNELQ--GQVT…
AGVCTVPMTA---ELQYYG--T…
AGVGTVPMTAYGNDIQYYGQVT…
----hhhHHHHHHhhh--eeEE…
Chou-Fasman Parameters
Name Abbrv P(a) P(b) P(turn) f(i) f(i+1) f(i+2) f(i+3)
Alanine A 142 83 66 0.06 0.076 0.035 0.058
Arginine R 98 93 95 0.07 0.106 0.099 0.085
Aspartic Acid D 101 54 146 0.147 0.11 0.179 0.081
Asparagine N 67 89 156 0.161 0.083 0.191 0.091
Cysteine C 70 119 119 0.149 0.05 0.117 0.128
Glutamic Acid E 151 37 74 0.056 0.06 0.077 0.064
Glutamine Q 111 110 98 0.074 0.098 0.037 0.098
Glycine G 57 75 156 0.102 0.085 0.19 0.152
Histidine H 100 87 95 0.14 0.047 0.093 0.054
Isoleucine I 108 160 47 0.043 0.034 0.013 0.056
Leucine L 121 130 59 0.061 0.025 0.036 0.07
Lysine K 114 74 101 0.055 0.115 0.072 0.095
Methionine M 145 105 60 0.068 0.082 0.014 0.055
Phenylalanine F 113 138 60 0.059 0.041 0.065 0.065
Proline P 57 55 152 0.102 0.301 0.034 0.068
Serine S 77 75 143 0.12 0.139 0.125 0.106
Threonine T 83 119 96 0.086 0.108 0.065 0.079
Tryptophan W 108 137 96 0.077 0.013 0.064 0.167
Tyrosine Y 69 147 114 0.082 0.065 0.114 0.125
Valine V 106 170 50 0.062 0.048 0.028 0.053
Computational Approaches
Ab initio methods
Threading
Comperative Modelling
Fragment Assembly
Ab-initio protein structure prediction as
an optimization problem
1. Define a function that map protein

structures to some quality measure.
2. Solve the computational problem of
finding an optimal structure.
3. ☺
energy
Chen Keasar
conformation
BGU
A dream function
☺ Has a clear minimum in the native structure.
☺ Has a clear path towards the minimum.
☺ Global optimization algorithm should find the
native structure.
Chen Keasar
BGU
An approximate function
☺ Easier to design and compute.
Native structure not always the global minimum.
Global optimization methods do not converge. Many
alternative models (decoys) should be generated.
No clear way of choosing among them.
Decoy set
Chen Keasar
BGU
Fold Optimization
Simple lattice models

(HP-models)
Two types of residues:
hydrophobic and polar
2-D or 3-D lattice
The only force is
hydrophobic collapse
Score = number of H−H
contacts
Scoring Lattice Models
H/P model scoring:
Sometimes:
Penalize for buried polar or surface
hydrophobic residues
Learning from Lattice
Models
Ken Dill ~ 1997
Hydrophobic zipper effect

Basic element
electrons &
protons
atom
extended
atom
half a
residue
residue
Some Hinds &

residues Levitt
diamond torsion fine square fragments continuous

lattice angle lattice lattice
Chen Keasar
BGU
What can we do with
lattice models?
For smaller polypeptides, exhaustive

search can be used
Looking at the “best” fold, even in such a
simple model, can teach us interesting things
about the protein folding process
For larger chains, other optimization and
search methods must be used
Greedy, branch and bound
Evolutionary computing, simulated annealing
Graph theoretical methods
Inverse Protein Folding
Problem
Given a structure (or a functionality) identify
an amino acid sequence whose fold will be
that structure (exhibit that functionality).
Crucial problem in drug design.

NP-hard under most models.
PROTEIN THREADING
Thread the given sequence to the

different structural families exist in
structural databases
Choose the optimum structure based on
the potential energy function ( contact
potential, free energy, e.g.) used
Threading: Fold
recognition
Given:
Sequence:
IVACIVSTEYDVMKAAR
…
A database of
molecular coordinates
Map the sequence
onto each fold
Evaluate
Objective 1: improve
scoring function
Objective 2: folding
Protein Fold Families
(CATH,SCOP)
CATH website
www.cathdb.info
Genetic Algorithm used as
a search tool
We are searching for the minima of our fitness function composed of
profile and contact energy terms.
In this problem value encoding have been used. Parents are represented as
strings of positions. Population Size is 50.
A sample parent (string of positions) is figured below:
12345 10 11 12 13 14 23 24 25 26 27 28 29 30 31 32 55 56 57 58
Branch and Bound algorithm have been used to produce random initial
parents.
Mutation:
Mutation operator is the shifting of the structure’s position either to the right
or left by some units.
Crossover:
Two-point cross-over is applied where , selected suitable structures are
exchanged between two parents.
Our Aim
In this research, we have threaded a

structurally unknown protein sequence to
over 2200 SCOP family fold proteins and
sought the best fitting structural family.
We also tried to find the optimum fit of
the query sequence to a given fold.
Fitness Function
Energy function is a combination of

The sequence profile energy
Contact Potential energy (inter & intra
structural residues are taken into account)
TotalEnergy= p1 ( ProfileEnergy ) + c1(ContactEnergy)
The weights are chosen such that the contributing

energy from profile and contact energy terms will be
equal.
Profile Energy
We do structural alignment on all selected secondary

structural units of the sequences.
Same numbered secondary structural units are
selected.
Length of the units may differ.
-- P E E L L L R W A N F H L E N ( 1aoa)
-- S E K I L L K W V R Q T -- -- -- (1qag)
N S E K I L L S W V R Q S T R -- (1dxx)
Sixth helices of the selected all-alfa sequences

Profile Matrix calculated
from a structure group
Residue Names
A C D E F G H I K L M N P Q R S T V W Y -
-0.33 -0.67 0.68 0.01 -1.33 0.01 0.34 -1 0.01 -1.33 -0.67 2.34 -0.67 0.01 -0.33 0.34 0.01 -1 -1.33 -0.67 4.01
0.34 -2 -0.33 -1 -3.33 -0.67 -1.33 -3 -0.33 -3.33 -2.33 0.01 2.68 -0.33 -1.67 3.01 1.01 -2.33 -4 -2.33 0.01
-1 -3 2.01 6.01 -3 -3 0.01 -4 1.01 -3 -2 0.01 -1 2.01 0.01 -1 -1 -3 -3 -2 0.01
-1 -3 0.01 2.68 -3.67 -2.33 0.01 -3.33 4.34 -3 -2 0.01 -1 2.01 2.01 -0.33 -1 -3 -3 -2 0.01
-1.33 -2 -4 -3.67 0.34 -4 -3.67 4.01 -3 3.01 2.34 -3.33 -3.33 -2.67 -3.67 -3 -1 3.01 -2.67 -1 0.01
-2 -2 -4 -3 1.01 -4 -3 2.01 -3 5.01 3.01 -4 -4 -2 -3 -3 -1 1.01 -2 -1 0.01
-2 -2 -4 -3 1.01 -4 -3 2.01 -3 5.01 3.01 -4 -4 -2 -3 -3 -1 1.01 -2 -1 0.01
0.01 -2 -0.67 -0.67 -3 -1 -0.67 -3.33 1.01 -3 -2 0.34 -1.67 0.34 1.68 3.01 1.01 -2.33 -3.67 -1.67 0.01
-3 -5 -5 -3 1.01 -3 -3 -3 -3 -2 -1 -4 -4 -1 -3 -4 -3 -3 15.01 2.01 0.01
1.68 -1 -3.33 -2.33 -1.67 -2.67 -3.33 2.34 -2.33 0.01 0.34 -2.33 -2.33 -2.33 -2.67 -1 0.01 3.34 -3 -1.33 0.01
-1.67 -3.33 -0.67 0.01 -3.33 -2 0.34 -3.67 2.01 -3.33 -2 1.68 -2.67 0.68 4.34 -0.33 -0.67 -3 -3.33 -1.33 0.01
-1.67 -2.67 -1.67 0.34 0.01 -2.67 0.34 -2 0.01 -1 0.01 -1.33 -2 3.34 -0.33 -1 -1.33 -2.33 -0.33 0.68 0.01
-0.33 -1.67 -0.67 -0.67 -2 -1.33 2.34 -2.67 -0.33 -2.33 -1.33 0.68 -1.33 0.01 -0.67 2.01 1.68 -2 -3.33 -0.67 0.01
-0.67 -1 -1.67 -1.33 -0.33 -2 -1.67 0.34 -1.33 1.34 0.68 -1.33 -1.67 -1 -1.33 -0.33 1.34 0.34 -1.67 -1 2.01
-1 -2.33 0.01 2.01 -2 -2 0.01 -2.67 1.34 -2 -1.33 -0.33 -1.33 1.01 2.34 -0.67 -0.67 -2 -2 -1 2.01
-0.33 -0.67 0.68 0.01 -1.33 0.01 0.34 -1 0.01 -1.33 -0.67 2.34 -0.67 0.01 -0.33 0.34 0.01 -1 -1.33 -0.67 4.01
Positions
Profile scores
Contact Potential Energy
Based on the counts of frequency of

contacts in a database of known
structures converted into energy values.
In this study, contact potential energy is
the sum of energies of the residues that
are closer than seven angstroms in
distance to each other.
Jernigan’s & Dill’s Contact Potential
Energy Tables have been used.
Selected Benchmark Set
All Alfa Set :1aoa,1dxx,1qag
Fold: Calponin-homology domain, CH-domain core: 4 helices: bundle
Superfamily: Calponin-homology domain, CH-domain
Family: Calponin-homology domain, CH-domain
All Beta Set :1acx,1hzk,1noa,2mcm
Fold: Immunoglobulin-like beta-sandwich sandwich; 7 strands in 2 sheets
Superfamily: Actinoxanthin-like
Family: Actinoxanthin-like
Alfa+Beta Set : 1dwn,1e6t,1frs,1qbe,1una
Fold: RNA bacteriophage capsid protein
6-standed beta-sheet followed with 2 helices; meander
Superfamily: RNA bacteriophage capsid protein
Family: RNA bacteriophage capsid protein
Secondary structure
prediction results of the
family of all alfa proteins

Eight helixes of the following sequences are selected and
each sequence is threaded to the other one and the shifts
from the real structures are shown below.
Target Sequences
1aoa 1dxx 1qag

1aoa T T T T T T T 30 T T T -6 -1 -1 T 27 1 T T T T 12 T T
Template
1dxx T T T -4 1 5 4 9 T -3 T -5 T T T T 3 T T T 1 T 41 37
sequences
1qag -1 T T -5 T 4 41 32 5 1 T -6 -1 T -13 -1 TTTTTTTT
Secondary structure
family of all beta proteins
Nine beta sheets of the following sequences are selected

and each sequence is threaded to the other one and the
shifts from the real structures are shown below.
Target Sequences
1acx 1hzk 1noa 2mcm
Template 1acx T T T T T T T T T 1 T T T T -2 T T T T T T T T -3 -1 T T TTT2TT124

sequences 1hzk T T T T T T T T T TTTTTTTTT TTTTT14TT T T T T -3 -3 T T T
1noa T T T T T T 1 T T T T T T T -1 T T T TTTTTT5TT T T T T -2 -2 T T T
2mcm T T T T T T T T T TTTTTTTTT TTT1TTTTT T T T 1 T -1 T T T

family of
alfa-beta proteins

Target Sequences
1dwn 1e6t 1frs 1qbe 1una
1dwn TT4TTTT4 TTTTTTT5 TTTTTTTT TTTTTTT5 T T T -1 -1 T T 1
1e6t TTTTTTTT TTTTTTTT TTTTTTTT TTTTTTTT TTTTTTTT
1frs TTTTTTTT TTTTTTTT TTTTTTTT T T T T T T T -1 TTTTTT1T

1qbe -1 T T T 1 T T 3 T T -3 -11 T T T 4 T T -3 -11 T T T 1 TTTTTTTT -1 T T T T T T 1
1una TTT11TTT T T T -5 T T T T T T T -5 T T 1 T 1TTT2TTT TTTTTTTT
Template
sequences
Conclusion for fitting to a
given fold
We obtained very good results for all-beta and

alfa+beta proteins .
All alfa proteins gave good results generally but
we had some shifts for the all alfa structures.
The main reason for the alfa shifts was mainly
due to the fact that our all-alfa sequences had a
very different lenghts and highly variable
sequences which lowered the contribution from
the profile scores.
Fold Classification Results
1ubi Threading Results
-1000
-1200
-1400
-1600
-1800
Energy Values
-2000
-2200
-2400
Other members
-2600 of 1ubi's family
1e0q
-2800 1ubi
1f9j
-3000
0 100 200 300 400 500 600
Protein ID
All Beta
1acx Threading Results
-1000
-1200
-1400
-1600
-1800
Energy Values
-2000
-2200
1klo 1zfo
-2400 1c01
-2600
-2800
1acx
-3000
0 100 200 300 400 500 600 700
Protein ID
All Alpha
1bhd Threading Result
-1000
-1200
-1400
-1600
-1800
Energy Values
-2000
-2200
1hg6 1qld 2pcf
1dfu
-2400
-2600 1bhd
-2800
-3000
0 100 200 300 400 500 600 700
Protein ID
CONCLUSION
By optimising the fitting process with

genetic algorithm and using a correct
target function we have obtained quite
clear classifications in the base of families.
It is also possible to use this method for
superfamily classification by adjusting only
profile information and weights.
We also applied the method to 6 CASP
proteins and correctly classified their
folds.
THANKS to
Esra Vural
Aydın Akyol
Zerrin Işık
Özgür Gül
HOMOLOGY MODELLING
Using database search algorithms find the

sequence with known structure that best
matches the query sequence
Assign the structure of the core regions
obtained from the structure database to
the query sequence
Find the structure of the intervening loops
using loop closure algorithms
Homology Modeling: How it works
o Find template
o Align target sequence

with template
o Generate model:
- add loops
- add sidechains
o Refine model
Prediction of Protein
Structures
Examples – a few good examples
actual predicted actual predicted
actual predicted actual predicted

Prediction of Protein
Structures
Not so good example

1esr
TURALIGN: Constrained
Structural Alignment Tool For
Structure Prediction
Motivation -1:
Structure based Alignment
Most of the alignment algorithms are only

sequence dependent (Needleman-Wunsch &
Smith-Waterman )
Functional sites are usually mismatched
Fail to give the best alignment between
highly divergent sequences having very
similar structures
Motivation -2:
Structure prediction of novel
proteins
Using evolutionary information on
sequence confirmation
Secondary structure predictions and
possible locations of turns should be used
for threading
Preservation of favorable contacts
Methods
Motif Alignment Based on Dynamic Algorithm

Approach
Recursive Smith-Waterman Local Alignment
Algorithm with Affine Gap Penalty
Secondary Structure Similarity Matrix
BLOSSUM 62
Position Specific Entropy Information
Filtering step using neighbourhood information
Jernigan Contact Potential Matrix
Motif Alignment Using
Dynamic Algorithm
Dynamic Algorithm
In order to reduce possible matches of
motifs in target protein
Fill in a 2D matrix A in a way that:
E(i) : End position of ith motif
B(i) : Beginning of ith motif
L(i) : Length of ith motif
Tracing back : Include the paths that have score > 0.9xMax
–M(i,j) = 1 , if i th motif = j th motif , else 0

– A(i, j ) = max k <i ,l < j {A(k , l ) + A(i,j) × L(i) × ( 30 – |E(i)-B(k)-B(j) + E(l)|)}
Dynamic Algorithm
Functional sites and motifs in template

protein can be either given as input to the
program or prosite scan* tool is used to
detect the motifs.
*Gattiker,A et.al. Bioinformatics 2002:1(2) 107-108.

Recursive Smith-Waterman Local
Alignment Algorithm with Affine
Gap Penalty
pL>0.9xpc
pL>0.9xpc
pc
pc
50
47 pR>0.9xpc
pR>0.9xpc
Gap Penalty
•A(i, j ) = max X ∈ { A, B , C } { X (i-1, j-1) + S(i,j)}
•B(i, j ) = max { A(i-1, j ) + go + ge, B(i-1, j ) + ge, C (i-1, j ) + go + ge}
•C (i, j ) = max { A(i, j-1) + go + ge, B(i, j-1) + go + ge, C (i, j-1) + ge}
Build 3 matrices:
A for the matches;
B for the gaps on template;
C for gaps on target.
⌧S(i,j) : Pairwise Similarity Score
⌧go : Gap opening penalty
⌧ge : Gap extension penalty
Tracing back : Include the paths that have score > 0.9xMax
Gap Penalty
S(i,j) = sc × SSS(i,j) + ac × SS(i,j) + tc × TS(i,j)
SSS(i,j) : Secondary Structure Similarity

SS(i,j) : Sequence Similarity
TS(i,j) : Turn Similarity
⌧sc : Secondary Structure Similarity Coefficient
⌧ac : Sequence Similarity Coefficient
⌧tc : Turn Similarity Coefficient
Secondary Structure
Similarity
S H E L
H H L
H:0.7 0.5 0.0 H 2 -15 -4
E:0.2 0.4 0.3 E -15 4 -4
L:0.1 0.1 0.6 L -4 -4 2
Secondary Structure Similarity Matrix*

Secondary Structure Prediction Servers
3
SSS (i , j ) = sc × ∑ S (T (i ), k ) • P ( k , i )
k =1
T (i ) : Secondary Structure of Template at position i
P (., j ) : Secondary Structure profile of Target at position j
sc : Secondary Structure Similarity Coefficient
*Wallqvist,A et al. Bioinformatics. 2000 Nov;16(11):988-1002.
Sequence Similarity
Multiple Sequence Alignment
of
Template Protein’s family* 20
...ALVKLI...
S ( j ) = −∑ P(i, j ) × log P(i, j )
i =1
...A-IEII...
...AL-KLI... 1
C (i ) =
1 + S (i )
S (i ) : Entropy at position j of template
P : Family Profile Matrix
C (i ) : Conservati on score at position j of template
SS (i, j ) =ac × C (i ) × BLOSSUM 62(i, j )

*Glaser,F. Et al. Bioinformatics 19:163-164(2003)
Turn Similarity
T T N
T:0.7 0.5 0.0
N:0.3 0.5 1.0
TS (i, j ) = tc × 4 × T (i ) × P(T , j )
T (i ) = 1 if i = T; else 0
Turn Prediction Servers
P(., j ) : Turn profile of Target at position j
tc : Turn Similarity Coefficient
Gap Penalties
...L... 2
...-... go = − × go
3
2
ge = − × ge
3
...H/E... gapSec = −20

... - ...
And vice versa...

Filtering
For each of the motif

alignments get the 25
best alignments
Build a connectivity map
of template protein and CS = cs × ∑ − Γ (i , j ) × J (i , j )
i =1, j < i
thread onto target. Γ : Kirchoff Matrix
J : Jernigan Contact Potential Matrix *
 
 − 1 if i ≠ j ∧ Rij ≤ 7 .3Å 
Get the best 25 alignments  
Γ (i , j ) = 0 if i ≠ j ∧ Rij > 7 .3Å 
According to the score:  
- ∑ Γ (i , j ) if i = j 
 i, i ≠ j 
TS = S + CS
*Miyazawa S, Jernigan R L.(1983) Macromolecules ;18:534–552.
RESULTS
To test our program we have chosen 3

families from ASTRAL40* protein list.
Citrate Synthase : 1csh,1iomA,1k3pA
Methionine aminopeptidase:1b6a,1xgsA
Methyltransferase:1fp2A,1fp1D
As testing measure: RMSD between the
predicted and actual structure of target.
RESULTS
For all the experiments done, our algorithm perfectly matched
functional sites and motifs given as input to the program.
1csh vs 1iomA :
⌧RMSD = 2.50
1csh vs 1k3pA
⌧RMSD = 2.12
1k3pA vs 1iomA
⌧RMSD = 3.03
1b6a vs 1xgsA
⌧RMSD = 2.23
1fp2A vs 1fp1D
⌧RMSD = 2.98
At average we got the best results for 5 experiments:
⌧RMSD = 2.57 with ac:0.4,sc:0.4,tc:0.2,cc:0
User Interface of TURALIGN
DOMAIN INTERACTIONS
Thanks to
Tural Aksel
Bora Uyar
Eylül Harputlugil

UGUR

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

UGUR

Uploaded by

Copyright:

Available Formats

Computational Methods in

Knowing the structure of molecules

It is believed that hydrophobic collapse is

1. Define a function that map protein

Simple lattice models

H/P model scoring:

Ken Dill ~ 1997

Hydrophobic zipper effect

Some Hinds &

diamond torsion fine square fragments continuous

For smaller polypeptides, exhaustive

Crucial problem in drug design.

Thread the given sequence to the

In this research, we have threaded a

Energy function is a combination of

TotalEnergy= p1 ( ProfileEnergy ) + c1(ContactEnergy)

The weights are chosen such that the contributing

 We do structural alignment on all selected secondary

Sixth helices of the selected all-alfa sequences

Based on the counts of frequency of

1aoa 1dxx 1qag

Nine beta sheets of the following sequences are selected

1acx 1hzk 1noa 2mcm

Template 1acx T T T T T T T T T 1 T T T T -2 T T T T T T T T -3 -1 T T TTT2TT124

2mcm T T T T T T T T T TTTTTTTTT TTT1TTTTT T T T 1 T -1 T T T

1dwn 1e6t 1frs 1qbe 1una

1dwn TT4TTTT4 TTTTTTT5 TTTTTTTT TTTTTTT5 T T T -1 -1 T T 1

1e6t TTTTTTTT TTTTTTTT TTTTTTTT TTTTTTTT TTTTTTTT

1frs TTTTTTTT TTTTTTTT TTTTTTTT T T T T T T T -1 TTTTTT1T

1una TTT11TTT T T T -5 T T T T T T T -5 T T 1 T 1TTT2TTT TTTTTTTT

We obtained very good results for all-beta and

By optimising the fitting process with

Using database search algorithms find the

o Align target sequence

Examples – a few good examples

actual predicted actual predicted

actual predicted actual predicted

Not so good example

Most of the alignment algorithms are only

Motif Alignment Based on Dynamic Algorithm

–M(i,j) = 1 , if i th motif = j th motif , else 0

Functional sites and motifs in template

*Gattiker,A et.al. Bioinformatics 2002:1(2) 107-108.

SSS(i,j) : Secondary Structure Similarity

Secondary Structure Similarity Matrix*

SS (i, j ) =ac × C (i ) × BLOSSUM 62(i, j )

...H/E... gapSec = −20

And vice versa...

For each of the motif

To test our program we have chosen 3

You might also like

We do structural alignment on all selected secondary