You are on page 1of 65

Computational Methods in

Molecular Modelling

Uğur Sezerman
Biological Sciences and Bioengineering Program
Sabancı University, Istanbul
Motivation

Knowing the structure of molecules


enables us to understand its mechanism
of function
Current experimental techniques
X-ray cystallography
NMR
PROTEIN FOLDING
PROBLEM
STARTING FROM AMINO ACID SEQUENCE
FINDING THE STRUCTURE OF PROTEINS IS
CALLED THE PROTEIN FOLDING PROBLEM
Forces driving protein
folding

It is believed that hydrophobic collapse is


a key driving force for protein folding
Hydrophobic core
Polar surface interacting with solvent
Minimum volume (no cavities) Van der
Walls
Disulfide bond formation stabilizes
Hydrogen bonds
Polar and electrostatic interactions
SECONDARY STRUCTURE
PREDICTION

Intro. To Struc.
(Tooze and Branden)
Secondary Structure
Prediction
AGVGTVPMTAYGNDIQYYGQVT…
A-VGIVPM-AYGQDIQY-GQVT…
AG-GIIP--AYGNELQ--GQVT…
AGVCTVPMTA---ELQYYG--T…

AGVGTVPMTAYGNDIQYYGQVT…
----hhhHHHHHHhhh--eeEE…
Chou-Fasman Parameters
Name Abbrv P(a) P(b) P(turn) f(i) f(i+1) f(i+2) f(i+3)
Alanine A 142 83 66 0.06 0.076 0.035 0.058
Arginine R 98 93 95 0.07 0.106 0.099 0.085
Aspartic Acid D 101 54 146 0.147 0.11 0.179 0.081
Asparagine N 67 89 156 0.161 0.083 0.191 0.091
Cysteine C 70 119 119 0.149 0.05 0.117 0.128
Glutamic Acid E 151 37 74 0.056 0.06 0.077 0.064
Glutamine Q 111 110 98 0.074 0.098 0.037 0.098
Glycine G 57 75 156 0.102 0.085 0.19 0.152
Histidine H 100 87 95 0.14 0.047 0.093 0.054
Isoleucine I 108 160 47 0.043 0.034 0.013 0.056
Leucine L 121 130 59 0.061 0.025 0.036 0.07
Lysine K 114 74 101 0.055 0.115 0.072 0.095
Methionine M 145 105 60 0.068 0.082 0.014 0.055
Phenylalanine F 113 138 60 0.059 0.041 0.065 0.065
Proline P 57 55 152 0.102 0.301 0.034 0.068
Serine S 77 75 143 0.12 0.139 0.125 0.106
Threonine T 83 119 96 0.086 0.108 0.065 0.079
Tryptophan W 108 137 96 0.077 0.013 0.064 0.167
Tyrosine Y 69 147 114 0.082 0.065 0.114 0.125
Valine V 106 170 50 0.062 0.048 0.028 0.053
Computational Approaches

 Ab initio methods
 Threading
 Comperative Modelling
 Fragment Assembly
Ab-initio protein structure prediction as
an optimization problem

1. Define a function that map protein


structures to some quality measure.
2. Solve the computational problem of
finding an optimal structure.
3. ☺
energy

Chen Keasar
conformation
BGU
A dream function
☺ Has a clear minimum in the native structure.
☺ Has a clear path towards the minimum.
☺ Global optimization algorithm should find the
native structure.

Chen Keasar
BGU
An approximate function
☺ Easier to design and compute.
 Native structure not always the global minimum.
 Global optimization methods do not converge. Many
alternative models (decoys) should be generated.
 No clear way of choosing among them.

Decoy set
Chen Keasar
BGU
Fold Optimization

Simple lattice models


(HP-models)
Two types of residues:
hydrophobic and polar
2-D or 3-D lattice
The only force is
hydrophobic collapse
Score = number of H−H
contacts
Scoring Lattice Models

H/P model scoring:

Sometimes:
Penalize for buried polar or surface
hydrophobic residues
Learning from Lattice
Models

Ken Dill ~ 1997

Hydrophobic zipper effect


Basic element
electrons &
protons

atom
extended
atom
half a
residue

residue

Some Hinds &


residues Levitt

diamond torsion fine square fragments continuous


lattice angle lattice lattice
Chen Keasar
BGU
What can we do with
lattice models?

For smaller polypeptides, exhaustive


search can be used
Looking at the “best” fold, even in such a
simple model, can teach us interesting things
about the protein folding process
For larger chains, other optimization and
search methods must be used
Greedy, branch and bound
Evolutionary computing, simulated annealing
Graph theoretical methods
Inverse Protein Folding
Problem
Given a structure (or a functionality) identify
an amino acid sequence whose fold will be
that structure (exhibit that functionality).

Crucial problem in drug design.


NP-hard under most models.
PROTEIN THREADING

Thread the given sequence to the


different structural families exist in
structural databases
Choose the optimum structure based on
the potential energy function ( contact
potential, free energy, e.g.) used
Threading: Fold
recognition
Given:
Sequence:
IVACIVSTEYDVMKAAR

A database of
molecular coordinates
Map the sequence
onto each fold
Evaluate
Objective 1: improve
scoring function
Objective 2: folding
Protein Fold Families
(CATH,SCOP)

 CATH website
www.cathdb.info
Genetic Algorithm used as
a search tool
 We are searching for the minima of our fitness function composed of
profile and contact energy terms.
 In this problem value encoding have been used. Parents are represented as
strings of positions. Population Size is 50.
 A sample parent (string of positions) is figured below:
12345 10 11 12 13 14 23 24 25 26 27 28 29 30 31 32 55 56 57 58
 Branch and Bound algorithm have been used to produce random initial
parents.
 Mutation:
Mutation operator is the shifting of the structure’s position either to the right
or left by some units.
 Crossover:
Two-point cross-over is applied where , selected suitable structures are
exchanged between two parents.
Our Aim

In this research, we have threaded a


structurally unknown protein sequence to
over 2200 SCOP family fold proteins and
sought the best fitting structural family.
We also tried to find the optimum fit of
the query sequence to a given fold.
Fitness Function

Energy function is a combination of


The sequence profile energy
Contact Potential energy (inter & intra
structural residues are taken into account)

TotalEnergy= p1 ( ProfileEnergy ) + c1(ContactEnergy)

The weights are chosen such that the contributing


energy from profile and contact energy terms will be
equal.
Profile Energy

 We do structural alignment on all selected secondary


structural units of the sequences.
 Same numbered secondary structural units are
selected.
 Length of the units may differ.
-- P E E L L L R W A N F H L E N ( 1aoa)
-- S E K I L L K W V R Q T -- -- -- (1qag)
N S E K I L L S W V R Q S T R -- (1dxx)

Sixth helices of the selected all-alfa sequences


Profile Matrix calculated
from a structure group
Residue Names

 A C D E F G H I K L M N P Q R S T V W Y -
-0.33 -0.67 0.68 0.01 -1.33 0.01 0.34 -1 0.01 -1.33 -0.67 2.34 -0.67 0.01 -0.33 0.34 0.01 -1 -1.33 -0.67 4.01
0.34 -2 -0.33 -1 -3.33 -0.67 -1.33 -3 -0.33 -3.33 -2.33 0.01 2.68 -0.33 -1.67 3.01 1.01 -2.33 -4 -2.33 0.01
-1 -3 2.01 6.01 -3 -3 0.01 -4 1.01 -3 -2 0.01 -1 2.01 0.01 -1 -1 -3 -3 -2 0.01
-1 -3 0.01 2.68 -3.67 -2.33 0.01 -3.33 4.34 -3 -2 0.01 -1 2.01 2.01 -0.33 -1 -3 -3 -2 0.01
-1.33 -2 -4 -3.67 0.34 -4 -3.67 4.01 -3 3.01 2.34 -3.33 -3.33 -2.67 -3.67 -3 -1 3.01 -2.67 -1 0.01
-2 -2 -4 -3 1.01 -4 -3 2.01 -3 5.01 3.01 -4 -4 -2 -3 -3 -1 1.01 -2 -1 0.01
-2 -2 -4 -3 1.01 -4 -3 2.01 -3 5.01 3.01 -4 -4 -2 -3 -3 -1 1.01 -2 -1 0.01
0.01 -2 -0.67 -0.67 -3 -1 -0.67 -3.33 1.01 -3 -2 0.34 -1.67 0.34 1.68 3.01 1.01 -2.33 -3.67 -1.67 0.01
-3 -5 -5 -3 1.01 -3 -3 -3 -3 -2 -1 -4 -4 -1 -3 -4 -3 -3 15.01 2.01 0.01
1.68 -1 -3.33 -2.33 -1.67 -2.67 -3.33 2.34 -2.33 0.01 0.34 -2.33 -2.33 -2.33 -2.67 -1 0.01 3.34 -3 -1.33 0.01
-1.67 -3.33 -0.67 0.01 -3.33 -2 0.34 -3.67 2.01 -3.33 -2 1.68 -2.67 0.68 4.34 -0.33 -0.67 -3 -3.33 -1.33 0.01
-1.67 -2.67 -1.67 0.34 0.01 -2.67 0.34 -2 0.01 -1 0.01 -1.33 -2 3.34 -0.33 -1 -1.33 -2.33 -0.33 0.68 0.01
-0.33 -1.67 -0.67 -0.67 -2 -1.33 2.34 -2.67 -0.33 -2.33 -1.33 0.68 -1.33 0.01 -0.67 2.01 1.68 -2 -3.33 -0.67 0.01
-0.67 -1 -1.67 -1.33 -0.33 -2 -1.67 0.34 -1.33 1.34 0.68 -1.33 -1.67 -1 -1.33 -0.33 1.34 0.34 -1.67 -1 2.01
-1 -2.33 0.01 2.01 -2 -2 0.01 -2.67 1.34 -2 -1.33 -0.33 -1.33 1.01 2.34 -0.67 -0.67 -2 -2 -1 2.01
-0.33 -0.67 0.68 0.01 -1.33 0.01 0.34 -1 0.01 -1.33 -0.67 2.34 -0.67 0.01 -0.33 0.34 0.01 -1 -1.33 -0.67 4.01

Positions
Profile scores
Contact Potential Energy

Based on the counts of frequency of


contacts in a database of known
structures converted into energy values.
In this study, contact potential energy is
the sum of energies of the residues that
are closer than seven angstroms in
distance to each other.
Jernigan’s & Dill’s Contact Potential
Energy Tables have been used.
Selected Benchmark Set
All Alfa Set :1aoa,1dxx,1qag
Fold: Calponin-homology domain, CH-domain core: 4 helices: bundle
Superfamily: Calponin-homology domain, CH-domain
Family: Calponin-homology domain, CH-domain
All Beta Set :1acx,1hzk,1noa,2mcm
Fold: Immunoglobulin-like beta-sandwich sandwich; 7 strands in 2 sheets
Superfamily: Actinoxanthin-like
Family: Actinoxanthin-like
Alfa+Beta Set : 1dwn,1e6t,1frs,1qbe,1una
Fold: RNA bacteriophage capsid protein
6-standed beta-sheet followed with 2 helices; meander
Superfamily: RNA bacteriophage capsid protein
Family: RNA bacteriophage capsid protein
Secondary structure
prediction results of the
family of all alfa proteins


Eight helixes of the following sequences are selected and
each sequence is threaded to the other one and the shifts
from the real structures are shown below.

Target Sequences

1aoa 1dxx 1qag


1aoa T T T T T T T 30 T T T -6 -1 -1 T 27 1 T T T T 12 T T
Template
1dxx T T T -4 1 5 4 9 T -3 T -5 T T T T 3 T T T 1 T 41 37
sequences
1qag -1 T T -5 T 4 41 32 5 1 T -6 -1 T -13 -1 TTTTTTTT
Secondary structure
prediction results of the
family of all beta proteins

 Nine beta sheets of the following sequences are selected


and each sequence is threaded to the other one and the
shifts from the real structures are shown below.
Target Sequences

1acx 1hzk 1noa 2mcm

Template 1acx T T T T T T T T T 1 T T T T -2 T T T T T T T T -3 -1 T T TTT2TT124


sequences 1hzk T T T T T T T T T TTTTTTTTT TTTTT14TT T T T T -3 -3 T T T

1noa T T T T T T 1 T T T T T T T -1 T T T TTTTTT5TT T T T T -2 -2 T T T

2mcm T T T T T T T T T TTTTTTTTT TTT1TTTTT T T T 1 T -1 T T T


prediction results of the
family of
alfa-beta proteins


Target Sequences

1dwn 1e6t 1frs 1qbe 1una

1dwn TT4TTTT4 TTTTTTT5 TTTTTTTT TTTTTTT5 T T T -1 -1 T T 1

1e6t TTTTTTTT TTTTTTTT TTTTTTTT TTTTTTTT TTTTTTTT

1frs TTTTTTTT TTTTTTTT TTTTTTTT T T T T T T T -1 TTTTTT1T


1qbe -1 T T T 1 T T 3 T T -3 -11 T T T 4 T T -3 -11 T T T 1 TTTTTTTT -1 T T T T T T 1

1una TTT11TTT T T T -5 T T T T T T T -5 T T 1 T 1TTT2TTT TTTTTTTT

Template
sequences
Conclusion for fitting to a
given fold

We obtained very good results for all-beta and


alfa+beta proteins .
All alfa proteins gave good results generally but
we had some shifts for the all alfa structures.
 The main reason for the alfa shifts was mainly
due to the fact that our all-alfa sequences had a
very different lenghts and highly variable
sequences which lowered the contribution from
the profile scores.
Fold Classification Results
1ubi Threading Results

-1000

-1200

-1400

-1600

-1800
Energy Values

-2000

-2200

-2400

Other members
-2600 of 1ubi's family
1e0q
-2800 1ubi
1f9j

-3000
0 100 200 300 400 500 600
Protein ID
All Beta
1acx Threading Results

-1000

-1200

-1400

-1600

-1800
Energy Values

-2000

-2200

1klo 1zfo
-2400 1c01

-2600

-2800
1acx

-3000
0 100 200 300 400 500 600 700
Protein ID
All Alpha
1bhd Threading Result

-1000

-1200

-1400

-1600

-1800
Energy Values

-2000

-2200
1hg6 1qld 2pcf
1dfu
-2400

-2600 1bhd

-2800

-3000
0 100 200 300 400 500 600 700
Protein ID
CONCLUSION

By optimising the fitting process with


genetic algorithm and using a correct
target function we have obtained quite
clear classifications in the base of families.
It is also possible to use this method for
superfamily classification by adjusting only
profile information and weights.
We also applied the method to 6 CASP
proteins and correctly classified their
folds.
THANKS to
Esra Vural
Aydın Akyol
Zerrin Işık
Özgür Gül
HOMOLOGY MODELLING

Using database search algorithms find the


sequence with known structure that best
matches the query sequence
Assign the structure of the core regions
obtained from the structure database to
the query sequence
Find the structure of the intervening loops
using loop closure algorithms
Homology Modeling: How it works

o Find template

o Align target sequence


with template

o Generate model:
- add loops
- add sidechains

o Refine model
Prediction of Protein
Structures

 Examples – a few good examples

actual predicted actual predicted

actual predicted actual predicted


Prediction of Protein
Structures

 Not so good example


1esr
TURALIGN: Constrained
Structural Alignment Tool For
Structure Prediction
Motivation -1:
Structure based Alignment

Most of the alignment algorithms are only


sequence dependent (Needleman-Wunsch &
Smith-Waterman )
Functional sites are usually mismatched
Fail to give the best alignment between
highly divergent sequences having very
similar structures
Motivation -2:
Structure prediction of novel
proteins
Using evolutionary information on
sequence confirmation
Secondary structure predictions and
possible locations of turns should be used
for threading
Preservation of favorable contacts
Methods

Motif Alignment Based on Dynamic Algorithm


Approach
 Recursive Smith-Waterman Local Alignment
Algorithm with Affine Gap Penalty
Secondary Structure Similarity Matrix
BLOSSUM 62
Position Specific Entropy Information
Filtering step using neighbourhood information
Jernigan Contact Potential Matrix
Motif Alignment Using
Dynamic Algorithm
Motif Alignment Using
Dynamic Algorithm
In order to reduce possible matches of
motifs in target protein
Fill in a 2D matrix A in a way that:
E(i) : End position of ith motif
B(i) : Beginning of ith motif
L(i) : Length of ith motif
 Tracing back : Include the paths that have score > 0.9xMax

–M(i,j) = 1 , if i th motif = j th motif , else 0


– A(i, j ) = max k <i ,l < j {A(k , l ) + A(i,j) × L(i) × ( 30 – |E(i)-B(k)-B(j) + E(l)|)}
Motif Alignment Using
Dynamic Algorithm

Functional sites and motifs in template


protein can be either given as input to the
program or prosite scan* tool is used to
detect the motifs.

*Gattiker,A et.al. Bioinformatics 2002:1(2) 107-108.


Recursive Smith-Waterman Local
Alignment Algorithm with Affine
Gap Penalty
pL>0.9xpc

pL>0.9xpc
pc

pc
50

47 pR>0.9xpc

pR>0.9xpc
Recursive Smith-Waterman Local
Alignment Algorithm with Affine
Gap Penalty
•A(i, j ) = max X ∈ { A, B , C } { X (i-1, j-1) + S(i,j)}
•B(i, j ) = max { A(i-1, j ) + go + ge, B(i-1, j ) + ge, C (i-1, j ) + go + ge}
•C (i, j ) = max { A(i, j-1) + go + ge, B(i, j-1) + go + ge, C (i, j-1) + ge}

 Build 3 matrices:
A for the matches;
B for the gaps on template;
C for gaps on target.
⌧S(i,j) : Pairwise Similarity Score
⌧go : Gap opening penalty
⌧ge : Gap extension penalty
 Tracing back : Include the paths that have score > 0.9xMax
Recursive Smith-Waterman Local
Alignment Algorithm with Affine
Gap Penalty
S(i,j) = sc × SSS(i,j) + ac × SS(i,j) + tc × TS(i,j)

SSS(i,j) : Secondary Structure Similarity


SS(i,j) : Sequence Similarity
TS(i,j) : Turn Similarity
⌧sc : Secondary Structure Similarity Coefficient
⌧ac : Sequence Similarity Coefficient
⌧tc : Turn Similarity Coefficient
Secondary Structure
Similarity
S H E L
H H L
H:0.7 0.5 0.0 H 2 -15 -4
E:0.2 0.4 0.3 E -15 4 -4
L:0.1 0.1 0.6 L -4 -4 2

Secondary Structure Similarity Matrix*


Secondary Structure Prediction Servers
3
SSS (i , j ) = sc × ∑ S (T (i ), k ) • P ( k , i )
k =1
T (i ) : Secondary Structure of Template at position i
P (., j ) : Secondary Structure profile of Target at position j
sc : Secondary Structure Similarity Coefficient
*Wallqvist,A et al. Bioinformatics. 2000 Nov;16(11):988-1002.
Sequence Similarity
Multiple Sequence Alignment
of
Template Protein’s family* 20

...ALVKLI...
S ( j ) = −∑ P(i, j ) × log P(i, j )
i =1
...A-IEII...
...AL-KLI... 1
C (i ) =
1 + S (i )
S (i ) : Entropy at position j of template
P : Family Profile Matrix
C (i ) : Conservati on score at position j of template

SS (i, j ) =ac × C (i ) × BLOSSUM 62(i, j )


*Glaser,F. Et al. Bioinformatics 19:163-164(2003)
Turn Similarity
T T N
T:0.7 0.5 0.0
N:0.3 0.5 1.0
TS (i, j ) = tc × 4 × T (i ) × P(T , j )

T (i ) = 1 if i = T; else 0
Turn Prediction Servers
P(., j ) : Turn profile of Target at position j
tc : Turn Similarity Coefficient
Gap Penalties
...L... 2
...-... go = − × go
3
2
ge = − × ge
3

...H/E... gapSec = −20


... - ...

And vice versa...


Filtering

 For each of the motif


alignments get the 25
best alignments
 Build a connectivity map
of template protein and CS = cs × ∑ − Γ (i , j ) × J (i , j )
i =1, j < i
thread onto target. Γ : Kirchoff Matrix
J : Jernigan Contact Potential Matrix *
 
 − 1 if i ≠ j ∧ Rij ≤ 7 .3Å 
Get the best 25 alignments  
Γ (i , j ) = 0 if i ≠ j ∧ Rij > 7 .3Å 
According to the score:  
- ∑ Γ (i , j ) if i = j 
 i, i ≠ j 
TS = S + CS
*Miyazawa S, Jernigan R L.(1983) Macromolecules ;18:534–552.
RESULTS

To test our program we have chosen 3


families from ASTRAL40* protein list.
Citrate Synthase : 1csh,1iomA,1k3pA
Methionine aminopeptidase:1b6a,1xgsA
Methyltransferase:1fp2A,1fp1D
As testing measure: RMSD between the
predicted and actual structure of target.
RESULTS
 For all the experiments done, our algorithm perfectly matched
functional sites and motifs given as input to the program.
1csh vs 1iomA :
⌧RMSD = 2.50
1csh vs 1k3pA
⌧RMSD = 2.12
1k3pA vs 1iomA
⌧RMSD = 3.03
1b6a vs 1xgsA
⌧RMSD = 2.23
1fp2A vs 1fp1D
⌧RMSD = 2.98
 At average we got the best results for 5 experiments:
⌧RMSD = 2.57 with ac:0.4,sc:0.4,tc:0.2,cc:0
User Interface of TURALIGN
DOMAIN INTERACTIONS
Thanks to

 Tural Aksel
 Bora Uyar
 Eylül Harputlugil

You might also like