You are on page 1of 25

<— —<

A United-Residue Force Field for


Off-Lattice Protein-Structure
Simulations. I. Functional Forms and
Parameters of Long-Range Side-Chain
Interaction Potentials from Protein
Crystal Data

A. LIWO,1, 2 S. OŁDZIEJ,1 M. R. PINCUS, 3 R. J. WAWAK, 2


S. RACKOVSKY,4 H. A. SCHERAGA2
1
´ ul. Sobieskiego 18, 80-952 Gdansk,
Department of Chemistry, University of Gdansk, ´ Poland
2
Baker Laboratory of Chemistry, Cornell University, Ithaca, New York 14853-1301
3
Department of Pathology, Brooklyn Veterans Administration Medical Center, Brooklyn, New York
11209 and State University of New York, Health Science Center, Brooklyn, New York 11203
4
Department of Biomathematical Sciences, Mount Sinai School of Medicine, One Gustave L. Levy
Place, New York, New York 10029

Received 7 June 1996; accepted 11 September 1996

ABSTRACT: A two-stage procedure for the determination of a united-residue


potential designed for protein simulations is outlined. In the first stage, the
long-range and local-interaction energy terms of the total energy of a polypeptide
chain are determined by analyzing protein]crystal data and averaging the
all-atom energy surfaces. In the second stage Ždescribed in the accompanying
article., the relative weights of the energy terms are optimized so as to locate the
native structures of selected test proteins as the lowest energy structures. The

Correspondence to: H. A. Scheraga; e-mail: has5@cornell.edu Contract grant sponsors: Polish State Committee for Scientific
This article includes Supplementary Material available Research, contract grant number: PB 190rT09r96r10; National
from the authors upon request or via the Internet at Institute on Aging, contract grant number: AG 00322; National
ftp.wiley.comrpublicrjournalsrjccrsuppmatr18r849 or Institute of General Medical Sciences, contract grant number:
http:rrwww.journals.wiley.comrjcc GM 14312; National Science Foundation, contract grant num-
ber: MCB95-13167; National Cancer Institute, contract grant
number: CA 42500

Q 1997 by John Wiley & Sons, Inc. CCC 0192-8651 / 97 / 070849-25


LIWO ET AL.

goal of the work in the present study is to parameterize physically reasonable


functional forms of the potentials of mean force for side-chain interactions. The
potentials are of both radial and anisotropic type. Radial potentials include the
Lennard]Jones and the shifted Lennard]Jones potential Žwith the shift parameter
independent of orientation.. To treat the angular dependence of side-chain
interactions, three functional forms of the potential that were designed previously
to describe anisotropic systems are evaluated: Berne]Pechukas Ždilated
Lennard]Jones.; Gay]Berne Žshifted Lennard]Jones with orientation-dependent
shift parameters.; and Gay]Berne]Vorobjev Žthe same as the preceding one, but
with one more set of variable parameters.. These functional forms were used to
parameterize, within a short-distance range, the potentials of mean force for
side-chain pair interactions that are related by the Boltzmann principle to the
pair correlation functions determined from protein-crystal data. Parameter
determination was formulated as a generalized nonlinear least-squares problem
with the target function being the weighted sum of squares of the differences
between calculated and ‘‘experimental’’ Ži.e., estimated from protein-crystal
data. angular, radial-angular, and radial pair correlation functions, as well as
contact free energies. A set of 195 high-resolution nonhomologous structures
from the Protein Data Bank was used to calculate the ‘‘experimental’’ values.
The contact free energies were scaled by the slope of the correlation line
between side-chain hydrophobicities, calculated from the contact free energies,
and those determined by Fauchere and Pliska ˇ from the partition coefficients of
amino acids between water and n-octanol. The methylene group served to
define the reference contact free energy corresponding to that between the
glycine methylene groups of backbone residues. Statistical analysis of the
goodness of fit revealed that the Gay]Berne]Vorobjev anisotropic potential fits
best to the experimental radial and angular correlation functions and contact
free energies and therefore represents the free-energy surface of side-chain]
side-chain interactions most accurately. Thus, its choice for simulations of
protein structure is probably the most appropriate. However, the use of simpler
functional forms is recommended, if the speed of computations is an issue.
Q 1997 by John Wiley & Sons, Inc. J Comput Chem 18: 849]873, 1997

Keywords: protein structure prediction; united-residue representation of a


polypeptide chain; potential of mean force; radial and angular distribution
functions

chain, it can be converted to the all-atom chain,


Introduction and limited exploration of the conformational space
of the all-atom chain can then be carried out to
locate the global minimum in the all-atom repre-

T he force fields that use a representation of


amino-acid residues as one or two interaction
sites, hereafter referred to as united-residue poten-
sentation. Such a protocol has recently been devel-
oped and implemented with considerable success
by Skolnick et al. in predicting the three-dimen-
tials, have long been of interest in theoretical simu- sional structures of model monomeric helical pro-
lations of protein structure.1 ] 39 The primary teins,15, 16, 18 crambin Žwhich also contains a b-sheet
reason for this is that they involve much less section.,18 and the dimeric GCN4 leucine zipper.19
computational effort than all-atom or united-atom These investigators used an on-lattice representa-
representations of the polypeptide chain. This is tion of the polypeptide chains to obtain united-
especially important in protein-structure predic- residue structures which were then converted to
tion, where extensive search of the conformational full-atom chains by using a set of statistical rules
space of the polypeptide chain is required to locate determined from the Protein Data Bank. In our
its global minimum energy. After the global mini- recent reports, we have described a similar proce-
mum energy has been found for the simplified dure, based, however, on an off-lattice model of

850 VOL. 18, NO. 7


PROTEIN-STRUCTURE SIMULATIONS. I

polypeptide chains 38, 39 and a dipole-path method In contrast to the on-lattice potentials, they are
Žbased on an optimal hydrogen-bond network. to functions of continuous variables. Therefore, the
convert the a-carbon trace to an all-atom back- off-lattice approach to protein folding enables local
bone.38 This procedure succeeded in predicting the energy minimization to be carried out for gener-
three-dimensional structure of the avian pancreatic ated structures. Thus, many powerful techniques
polypeptide. for global-search minimization, such as Monte
As mentioned previously, there are two ways to Carlo with Minimization ŽMCM.,43, 44 the diffusion
explore the conformational space of polypeptide equation method ŽDEM.,45 or the self-consistent
chains with the use of a united-residue potential: mean torsional field ŽSCMTF. 46 method can be
the on-lattice and the off-lattice approach. In the applied. This was the rationale for choosing the
first case, the polypeptide chain is superposed on a off-lattice potential in the present work.
discrete lattice, and the number of possible confor- The present work was aimed at determining the
mations is, therefore, finite. In the simplest ap- long-range potential for side-chain interactions. We
proach, the interaction potential is reduced to a set parameterized several functional forms for the in-
of residue]residue contact free energies.8 ] 10 The teraction potential that also include angular de-
rationale for such an approach was based on the pendence. This was motivated by the results of a
assumption that side-chain packing is the principal preliminary analysis of the average dimensions of
driving force in protein folding; more recent stud- the side chains as calculated from the Protein Data
ies, however, have shown that this assumption is
Bank, which show that ‘‘geometrical’’ anisotropy
probably not true.13 The recent approach devel- Žwhich can be defined as the ratio of the long axis
oped in Skolnick’s group incorporates many differ-
of a side chain to the geometric average of the two
ent interactions that can be responsible for protein
shorter axes. is pronounced in almost all cases.
folding: side-chain packing; local interactions; hy-
drogen bonding; surface energy; and cooperativity
´ et al. noticed that the pair distribu-
Also, Kolinski
tions of side chains exhibit some anisotropy, and
in side-chain packing and hydrogen bonding.12 ] 19
included this effect in their on-lattice statistical
The contact and hydrogen-bonding energies de-
pend on the distance and orientation of the inter- potential.14 The short-range part of the potential is
acting sites. The resulting force field was able to presented in the accompanying work.47
locate the near-native structures of a number of
test proteins as the lowest energy ones.14 ] 16, 18, 19
The parameters of the potentials for on-lattice sim- Methods
ulations were determined from a statistical analy-
sis of the distributions of interacting sites obtained REPRESENTATION OF POLYPEPTIDE
from the crystal data of known proteins. Because CHAINS AND INTERACTION SCHEME
the aforementioned force field expresses most of
the energy components as analytical functions of The united-residue model of polypeptide chains
geometry, it can also be used for off-lattice simula- adopted in this work is a natural extension of the
tions. model developed in our earlier studies.38, 39 The
For the sake of completeness, we mention here chain is represented by a sequence of a-carbons
simple lattice models of proteins in which contact ŽC a . linked by virtual bonds with attached united
free energies and other interaction parameters are side chains ŽSC. and united peptide groups Žp.
assigned arbitrary values Žusually three types of located in the middle between the consecutive
contacts are chosen: hydrophobic]hydrophobic; a-carbons. Only the united peptide groups and
hydrophobic]hydrophilic; and hydrophilic]hydro- united side chains serve as interaction sites, the
philic.; however, such models were used to study a-carbons assisting in the definition of the geome-
general statistical]mechanical characteristics of try ŽFig. 1.. As in our previous model,38, 39 all the
polypeptide chains and the folding process, but virtual bond lengths Ži.e., C a —C a and C a —SC.
have not yet been used for predicting the three-di- are fixed; the C a —C a distance is taken as 3.8 A,˚
mensional structures of real proteins.40 ] 42 which corresponds to trans peptide bonds. We
The united-residue potentials for off-lattice sim- allow, however, for variation of the side-chain
ulations have an even longer history than the on- positions with respect to the backbone Ž a SC and
lattice ones.1 ] 7, 24 ] 37, 39 They have also been used bSC ., and for the variation of the virtual-bond
with considerable success to predict the three-di- angles, u , which were assumed fixed in our earlier
mensional structure of known proteins.28 ] 31, 35, 37, 38 approach.38, 39

JOURNAL OF COMPUTATIONAL CHEMISTRY 851


LIWO ET AL.

The energy of the virtual-bond chain is ex-


pressed by:

Us Ý USC SC i j
q Ý USC p
i j
q wel Ý Up i p j
i-j i/j i-jy1

q wt o r Ý Ut o r Ž g i .
i

q w l o c Ý Ub Ž u i . q Ur o t Ž a SC i , bSC i .
i

q wc o r r Uc o r r Ž1.

where USC i SC j , USC i p j , and Up i p j denote the energies


of the interactions between side chains, between
side chains and peptide groups, and between pep-
tide groups, respectively, Ut o r Žg i . denotes the en-
ergy of variation of the virtual-bond dihedral an-
gle g i , Ub Ž u i . denotes the ‘‘bending energy of the
virtual-bond angle u i , Ur o t Ž a SC i , bSC i . is the local
energy of side chain i, Uc o r r includes cooperative
terms Že.g., the four-body interactions considered FIGURE 1. United-residue representation of a
polypeptide chain. The interaction sites are side-chain
by Skolnick et al.,15 as will be shown in part III of
centroids of different sizes (SC) and peptide-bond
the present work. and the w values denote rela- centers (p) indicated by dashed circles, whereas the
tive weights of the respective energy terms. a-carbon atoms (small empty circles) are introduced
only to assist in defining the geometry. The virtual
GENERAL PROCEDURE OF C a —C a bonds have a fixed length of 3.8 A, ˚
PARAMETERIZATION corresponding to a trans peptide group; the virtual bond
( u ) and dihedral (g ) angles are variable. Each side chain
The following procedures are commonly used is attached to the corresponding a-carbon with a fixed
to parameterize united-residue potentials: ‘‘bond length,’’ b SC i , variable ‘‘bond angle,’’ a SC i ,
formed by SC i and the bisector of the angle defined by
a
1. Direct averaging of the all-atom potentials C iy1 , C ia, and C iq1
a
, and with a variable ‘‘dihedral angle’’
over those degrees of freedom that are lost b SC i of counterclockwise rotation about the bisector,
a
starting from the right side of the C iy1 , C ia, C iq1
a
frame.
when passing from the all-atom to the united-
residue representation of the polypeptide
chain.2y 5 This method directly implements 2. Determination of the united-residue po-
the assumption that united-residue potentials tentials so as to reproduce the single-body,
are formally all-atom potentials averaged pair, and possibly triplet distribution or cor-
over some ‘‘less important’’ degrees of free- relation functions, as well as contact free en-
dom Žsuch as the dihedral angles, x , of the ergies determined from protein crystal
side chains.. In this approach, functional data.24 ] 26, 32 ] 35 In this case, the potential is
forms are assumed for all energy terms, with expressed either in a functional form,24 ] 26 or
parameters determined by fitting energies as a set of values Žat given points. obtained
computed at chosen values of the variables by taking the negative logarithms of appro-
describing the geometry of interacting sites priately scaled correlation functions. The on-
to those obtained by direct averaging of the lattice potentials and the harmonic potentials
all-atom potentials. However, even in the of the distance-constraint approach of Goel
earliest attempts, it was recognized that some ˇ 21 and Wako and Scheraga22, 23 can
and Ycas
extra terms have to be added to account for be considered to belong to this class, al-
hydrophobic interactions between the side though the latter 21 ] 23 used the one- and
chains; in the early work of Levitt and two-body distribution functions directly.
Warshell1 and Levitt, 2 those terms were as- 3. A combination of the two preceding ap-
signed according to side-chain hydrophobici- proaches in which some part of the potential
ties. is determined by direct averaging of the all-

852 VOL. 18, NO. 7


PROTEIN-STRUCTURE SIMULATIONS. I

atom potential, for example, the local and tential, Up p , we use the energy function developed,
hydrogen-bonding interactions, and some es- and then parameterized through averaging of the
timated from protein crystal data. Such a all-atom ECEPPr2 potential,48, 49 in our earlier
division is motivated by the fact that, if di- studies.38, 39 The derivation of local-interaction
rect averaging is computationally feasible, as terms Ub Ž u . and Ur o t Ž a SC , bSC . from protein-crystal
in the case of the local and hydrogen-bond- data will be described in an accompanying article.
ing interactions, the resulting potential will In the accompanying work,47 we also describe the
always be more accurate than that calculated procedure for the determination of the relative
from experimental distribution functions, weights so as to locate the native structures of a
whose accuracy is severely limited by the set of training proteins as the lowest-energy ones.
sparse number of protein crystal data. Con- Therefore, our approach is a combination of all the
versely, obtaining the hydrophobic potential procedures to determine the aforementioned po-
by direct averaging is in most cases not feasi- tential. Use of distribution functions or averaging
ble, owing to the large number of degrees of of all-atom potentials to obtain individual energy
freedom over which averaging must be car- terms allows us to collect data from the PDB or
ried out Ži.e., the dihedral angles, x , for each from all-atom potential functions with meaningful
side chain. and possibly to the necessity of statistics. The use of flexible weights, which consti-
including explicit water molecules in the av- tute a small subset of adjustable parameters, en-
eraging. Such a combination was imple- ables us to scale the individual terms so as to
mented in our earlier work.38 The local-inter- obtain a folding potential. The procedure for
action and backbone hydrogen-bonding terms weight determination is described in the accompa-
were determined by direct averaging of the nying article.47
all-atom ECEPPr2 48, 49 potential. This was
motivated by the fact that local and hydro- MODELING SIDE-CHAIN INTERACTIONS
gen-bonding interactions are well repre-
sented in the ECEPP force field.50 The hy- The general form of the side-chain interaction
drophobic potential was assumed to have a ŽUSC SC . parameterized in this work is given by:
i j

modified Lennard]Jones form, the parame-


ters being assigned by the use of protein Ui j s 4 < e i j < x 12
i j y ei j x i j
6 Ž2.
crystal data, namely the side-chain van der
Waals radii 2 and the interresidue contact free where e i j is the pair-specific van der Waals well
energies.9 depth, which depends on side-chain orientation for
the potentials with angular dependence; as in our
4. Determination of the parameters of the po- earlier work, e ) 0 corresponds to hydrophobic]
tential so as to locate the native structures as hydrophobic-type and e - 0 to hydrophobic-hy-
global minima for a set of training proteins drophilic and hydrophilic-hydrophilic-type inter-
and, simultaneously, introducing a large en- actions. The quantity, x i j , is the reciprocal of the
ergy gap between the near-native and non- reduced distance between side chains; for
native structures. For on-lattice simulations, angular-dependent potentials, it also depends on
such an approach based on spin-glass theory the orientation of the side chains.
was developed by Wolynes and coworkers.20 We first consider radial-only potentials. We as-
A similar method was developed for off- sumed the following two functional forms for x i j :
lattice simulations by Crippen and cowork-
ers.26 ] 31 In both cases, the resulting poten- si 0j
tials appeared successful in predicting the xi j s Ž3.
ri j
native structures of proteins that were not
included in the training sets. ri0j
xi j s Ž4.
ri j q ri0j y si 0j
In this work, we have implemented procedure 3
to determine the parameters of individual energy Equation Ž3. corresponds to the Lennard]Jones
terms Žthe U values.. The side-chain interaction ŽLJ.-type potential. The constant si 0j , in this case,
and local terms are parameterized based on corre- can be identified with the equilibrium van der
lation functions collected from the Protein Data Waals distance between side chains i and j. Equa-
Bank ŽPDB.. For the peptide-group interaction po- tion Ž4. corresponds to the shifted Lennard]Jones

JOURNAL OF COMPUTATIONAL CHEMISTRY 853


LIWO ET AL.

potential of the form proposed by Kihara51


Žhereafter referred to as LJK.. In this case, the
quantities si 0j y ri0j and ri0j can be identified with
the dimensions of the ‘‘hard core’’ and the ‘‘soft
core’’ of the interacting bodies, respectively; we
express the hard-core diameter as a combination of
two terms, si 0j and ri0j , to maintain consistency of
the notation with that corresponding to the angu-
lar potentials.
The constants ri0j and si 0j , can be assumed to be
pair-specific or calculated from the constants that
pertain to single residues:

ri0j s ri0 q r j0 ; si 0j s 's i


02
q sj 02 Ž5.

To include angular dependence, we considered


three forms of anisotropic potentials derived on
the basis of the Gaussian-overlap model 52 : modi-
fied Berne]Pechukas 52 ; Gay]Berne,53, 54 which is
used in liquid-crystal simulations 54, 55 ; and, finally,
a potential developed by Vorobjev 56, 57 for
nucleic-acid simulations. The latter can be consid-
ered as a generalized form of the Gay]Berne po- FIGURE 2. Definition of the orientation of two
anisotropic side chains, SC i and SC j , represented by
tential. These three forms will hereafter be referred
ellipsoids of revolution. The relative position of the
to as BP, GB, and GBV, respectively. centers of the side chain is given by the vector r i j (of
All these potentials assume that the interacting length ri j ). The principal axes of the ellipsoids are
sites are ellipsoids of revolution. We placed the assumed to be colinear with the C a } SC lines; their
centers of the ellipsoids at the centers of mass of directions are given by the unit vectors u ˆ(1) and u ˆ(2). The
the side chains, the long axes being assumed to be variables defining the relative orientations of the ellipsoids
collinear with the C a ]SC axes. To describe the are the angles u iŽ1. j
(the planar angle between u ˆ(1)
ij
and
(2)
relative orientation of the ellipsoids, it is sufficient r i j ), u iŽ2.
j
(the planar angle between ˆi j
u and r i j , and f i j
)
to define three angles that describe the relative (the angle of counterclockwise rotation of the vector û(i j2)
orientation of their long axes: u iŽ1. j , u i j , and f i j
Ž2. about the vector r i j from the plane defined by u ˆ(1)
ij
and
ŽFig. 2.. Although such a model of angular depen- r i j ) when looking from the center of SC j toward the
dence is apparently a very simplified one, it keeps center of SC i .
the number of orientational parameters at a rea-
sonable minimum, which enables us to collect data
with meaningful statistics from the PDB. e iŽ3.j s 1 y a Ž1.
i j vi j q ai j vi j
Ž1. Ž2. Ž2.

The expressions for all three potentials are ob-


tained by introducing the angular dependence into 2
i j q ai j . vi j
y0.5 Ž a Ž1. Ž2. Ž12. Ž6.
e i j and x i j of the general expression given by eq.
Ž2.: x i j ' x Ž ri j , v iŽ1.j , v iŽ2.j , v iŽ12.
j .

e i j ' e Ž v iŽ1.
j , vi j , vi j . s ei j ei j ei j ei j
Ž2. Ž12. 0 Ž1. Ž2. Ž3. ¡s ij
for the BP potential
ri j
y1 r2
e iŽ1.j s 1 y x iŽ1.j x iŽ2.j v iŽ12.2
j si 0j
2 s ~r y si j q si 0j
for the GB potential
x iXŽ1. XŽ2. Ž2.2
j v i j q xi j v i j
Ž1.2
ij

y2 x iXŽ1. XŽ2. Ž1. Ž2. Ž12.


j xi j v i j v i j v i j ri0j
e iŽ2.j s 1 y
1 y x iXŽ1. XŽ2. Ž12.2
j xi j v i j
¢r ij y si j q ri0j
for the GBV potential

Ž7.

854 VOL. 18, NO. 7


PROTEIN-STRUCTURE SIMULATIONS. I

with: si 5 2 y si H 2 sj 5 2 y sj H 2
x iŽ1.j s ; x iŽ2.j s Ž9.
si 5 2 q sj H 2 sj 5 2 q si H 2
y1 r2
x iŽ1.j v iŽ1.2
j q xi j v i j
Ž2. Ž2.2

y2 x i j x i j v i j v i j v i j
Ž1. Ž2. Ž1. Ž2. Ž12.
Ž8. Further, for the Gaussian overlap model, it fol-
si j s si 0j 1 y lows 52 that si Hs si 0 and sj Hs sj 0 . The constants
1y x iŽ1.j x iŽ2.j v iŽ12.2
j
s H and s 5 can be identified with the lengths of
the short and long axes of the ellipsoids, respec-
tively. Our variable parameters were s 0 and the
v iŽ1.j s ˆ r i j s cos u iŽ1.j
i j ?ˆ
uŽ1.
ratio Ž s 5rs H . 2 ; the first one can be considered as
a measure of the size, and the second of the
v iŽ2.j s ˆ r i j s cos u iŽ2.j
i j ?ˆ
uŽ2.
anisotropy of a side chain.
The same type of dependence can be assumed
v iŽ12.
j sˆ i j ?ˆ
uŽ1. uŽ2.
ij
for the constants x X , but then there would be too
s cos u iŽ1.j cos u iŽ2.j q sin u iŽ1.j sin u iŽ2.j cos f i j many parameters to be determined. Therefore, we
assumed that the constants x X and a depend on
single-residue types, namely:
where ˆ i j and ˆ
uŽ1. i j are unit vectors along the
uŽ2.
principal axes of the interacting sites Žin this work
x iXŽ1. X
j s xi , x iXŽ2. X
j s xj
identified with the C a ]SC axes., r i j is the vector Ž 10.
linking the centers of the sites, ˆ r i j is the corre- a Ž1. s ai , a Ž2. s aj
ij ij
sponding unit vector, ri j is the distance between
the side-chain centers ŽFig. 2., the constants x iŽ1.j
and x iŽ2.j are the anisotropies of the van der Waals It should be noted that, in the case of isotropic
radius, and the constants x iXŽ1. j and x iXŽ2.
j are the interactions, the GB and BP potentials become the
anisotropies of the van der Waals well depth. LJ potential, whereas the GBV potential becomes
The angular dependence of e iŽ1.j and x i j arises the LJK potential.
from the extension of the Gaussian overlap poten-
tial to the LJ-type function.53 Additional depen-
PARAMETERIZING SIDE-CHAIN INTERACTION
dence of the van der Waals well depth on orienta-
POTENTIALS
tion in the form of e iŽ2.j has been introduced by
GB.53 For the original BP potential, e iŽ2.j s 1, but Similar to earlier work,24 ] 26 we determine the
we keep its orientational dependence to preserve parameters of the potentials introduced in the pre-
the same form of the potential. The formulas are ceding section by fitting them to correlation func-
generalized in this work to the case of ellipsoids of tions and contact free energies calculated from
revolution with different axes Žthe BP and GB protein-crystal data. In doing this, we make the
potentials were originally derived for the interac- following two assumptions:
tion of identical ellipsoidal bodies.. Finally, the
function e iŽ3.j with the constants a Ž1. i j and a i j has
Ž2.
1. The correlation functions obtained by using a
been introduced in this work to account for the sufficiently large number of protein crystal
lower symmetry of the angular distribution func- data Žeach of which corresponds to a system
tions observed in protein crystals than that implied at a free-energy minimum. are sufficiently
by the three potentials outlined previously. Squar- good approximations to the correlation func-
ing in the expressions for e iŽ2.j and e iŽ3.j is done to tions of a hypothetical ‘‘stochastic’’ mixture
keep them non-negative. of nonconnected side chains. This approxi-
As in the case of the radial potentials, the con- mation is justified by the observation that,
stants si 0j , ri0j , x iŽ1.j , x iŽ2.j , x iXŽ1. XŽ2.
j , x i j , a i j , and a i j can
Ž1. Ž2.
although a crystal structure is at equilibrium
be assumed to be pair-dependent or constrained to as the whole structure, its individual parts
be calculated from single-residue constants. In this can be forced to assume geometries far from
work, we tried both procedures. For the constants locally equilibrated, locally lower energy
ri0j and si 0j , the formulas are given by eq. Ž5.. For conformations having, however, higher prob-
the case of different ellipsoids, the anisotropies of ability of occurrence in the whole structure.58
the van der Waals distances can be expressed by For example, the distributions of X—H bond
eq. Ž9.: lengths obtained from large data bases of

JOURNAL OF COMPUTATIONAL CHEMISTRY 855


LIWO ET AL.

crystal structures are qualitatively similar to averaged over the angles u iŽ1.j , u iŽ2.j , and f i j x . Thus,
those calculated from potential-energy sur- the potentials of mean force and, in turn, the
faces of proton transfer.58 correlation functions depend parametrically on the
2. Interactions between the side chains can be constants appearing in eqs. Ž3. ] Ž8., which can be
described with sufficient accuracy by us- optimized so that the theoretical correlation func-
ing the potential of mean force, Wi j Ž ri j , tions given by eq. Ž11. fit best Žin the least-squares
u iŽ1.j , u iŽ2.j , f i j ., related directly to the corre- sense. to the correlation functions determined from
sponding side-chain pair correlation func- protein crystal data.
tions, g i j Ž ri j , u iŽ1.j , u iŽ2.j , f i j .: The limited number of data that we are able to
collect from protein crystals prohibits the direct
g i j Ž ri j , u iŽ1.j , u iŽ2.j , f i j . use of the full radial-and-angular correlation func-
tion g i j Ž ri j , u iŽ1.j , u iŽ2.j , f i j .. Therefore, our target
s exp yWi j Ž ri j , u iŽ1.j , u iŽ2.j , f i j . rRT Ž 11. function for parameter estimation includes correla-
tion functions that are averaged over some of the
where R and T are the gas constant and the variables, and side-chain contact free energies. The
absolute temperature, respectively. Accord- side-chain contact free energies are the logarithms
ing to point 1, the reference state in eq. Ž11. of the correlation functions averaged over the co-
corresponds to a hypothetical polypeptide ordination sphere of the interacting side chains. To
chain with noninteracting side chains Žthe determine the parameters of the potentials, we
unfolded state according to the classification minimized the weighted sum of the squares F ŽX.
of Godzik et al.59 .. of the differences between histograms of the radial
Because we want to exclude the effects of Ž H irj; k ., radial-angular Ž H irj;uk l m ., and angular
local interactions w since the local interactions Ž Hiufj; k l m correlation functions, as well as contact
.
are included in the terms Ub Ž u ., Ut o r Žg ., and free energies Ž Fi j ., calculated as functions of the
Ur o t Ž a SC , bSC . of eq. Ž1.x , we consider only parameters, and determined from protein-crystal
the side chains that are separated by at least data, respectively:
ten peptide groups. This also makes it legiti-
mate to disregard the direction of the chain 20 20 nri j
2
separating the residues; therefore, we assume
that Wi k , jl s Wi l , jk, where i k denotes a residue
F ŽX. s Ý Ý wi j
is1 js1
½ wr Ý
ks1
Hirj; k Ž X . y Hˆirj; k

of type i occupying the k th position in the nu nu nf


2
chain. qw uf Ý Ý Ý Hiuf ˆuf
j; k l m X y Hi j; k l m
Ž .
ks1 ls1 ms1
nri j nu nu
To avoid the influence of many-body and 2
boundary effects on the distributions at large dis- qw ru Ý Ý Ý Hirj;u k l m Ž X . y Hˆirj;u k l m
ks1 ls1 ms1
tances, we confine our treatment to a short-range
distance limit r F rimj a x , and we express the poten- 2
tial of mean force by one of the analytical forms
given by eqs. Ž2. ] Ž8.. The upper distance limits
qw F Fi j Ž X . y Fˆi j 5
rimj a x are defined so that only the regions of the s w r F r Ž X. q w uf F uf Ž X.
first peak of the correlation functions are consid- q w ru F ru Ž X . q w F F F Ž X . Ž 13.
ered, in which the potential of mean force is uni-
modal in r: where the indices i, j run over all 20 naturally
occurring amino acids. Also:
½ ˚ 8A
rimj a x s min 0.5 Ž ri0; L q r j0; L . q 2 A, ˚ 5 Ž 12. np np
wi j s Ý wp n i jr Ý Ý wp n i j; p Ž 14.
where r 0; L are the mean side-chain van der Waals ps1 iFj ps1
radii for each of the 20 naturally occurring amino
acids calculated by Levitt.2 is the statistical weight of the pair of side chains of
For the LJ and LJK potentials w eqs. Ž3. and Ž4.x , types i and j Ž wp and np being the statistical
which depend only on the distance between the weight of the pth protein and the total number of
side chains, g i j Ž ri j , u iŽ1.j , u iŽ2.j , f i j . of eq. Ž11. is re- protein structures in the sample, respectively, and
placed by the radial-only correlation function, n i j and n i j; p being the total number of pairs of
g i j Ž ri j . w which is equivalent to g i j Ž ri j , u iŽ1.j , u iŽ2.j , f i j . residues of types i and j and the number of such

856 VOL. 18, NO. 7


PROTEIN-STRUCTURE SIMULATIONS. I

pairs in protein p, respectively.; nri j is the num- Žcircumflex. over a quantity designates the value
ber of distance values considered for a pair of side determined from the crystal data. For the radial-
chains of types i and j; nu and nf are the numbers only potentials, LJ and LJK, the angular and
of values of the angles u Ž1. or u Ž2. ; and f , w uf , w r , radial-angular components F uf ŽX. and F ru ŽX. do
w ru , and w F are weights of the histograms and of not appear in the expression for F ŽX..
the free energy, respectively; X is a shorthand for The histograms of the correlation functions Hirj; k ,
the parameters of the target potential, and a ‘‘hat’’ Hi j; k l m , and Hiuf
ru
j; k l m are defined by:

g irj Ž r k .
Hirj; k s
Ý nks1
ri j
g irj Ž r k .

g iuf
j u k , u l , f m D cos u k D cos u l
Ž Ž1. Ž2. . Ž1. Ž2.
Hiuf s uf Ž Ž1.
1 Ý ls1 Ý ms1 g i j u k , u l , f m D cos u k D cos u l
j; k l m
Ý nks
u nu nf Ž2. . Ž1. Ž2.

g irju Ž r k , u lŽ1. , umŽ2. . D cos u lŽ1.D cos umŽ2.


Hirj;u k l m s ru Ž
Ž 15.
Ý nks 1 Ý ls1 Ý ms1 g i j r k , u l , um D cos u lŽ1.D cos umŽ2.
ri j nu nu Ž1. Ž2. .

where g irj Ž r k ., g iufj u k , u l , f m , and


Ž Ž1. Ž2. . g irju Ž r k , ˇ 60 ; this will be described in the subsec-
and Pliska
u l , um are the average values of the radial, an-
Ž1. Ž2. .
tion ‘‘Contact Free Energies’’ of the ‘‘Results’’ sec-
gular and radial-angular correlation functions tion. The corresponding ‘‘theoretical’’ values of the
within bins defined as w r k y D rr2, r k q D rr2x , free energies are calculated from:
w u kŽ1. y D ur2, u kŽ1. q D ur2x = w u lŽ2. y D ur2, u lŽ2. q
D ur2x = w fm y D fr2, fm q D f x , and w r k y D rr2,
Fi j Ž X . s yRT
r k q D rr2x = w u lŽ1. y D ur2, u lŽ1. q D ur2x = w umŽ2. y
D ur2, umŽ2. q D ur2x , respectively, with D r, D u , and
D f being the dimensions of the bins. The defini- = ln Ž 1rVi cj . HV g i j Ž D , q Ž1. , q Ž2. , w . dV
c
tions and method of calculation of the correlation ij

functions from protein crystal data are given in the Ž 16.


Appendix.
The purpose of the introduction of the factors where V ci j s SŽ0, R c . y SŽ0, ric q r jc . is the allowed
Dcos u kŽ1. D cos u lŽ2. into the angular and radial-
coordination sphere corresponding to pair ij
angular terms is to avoid overweighting the re- w SŽa, r . denoting the region of space bounded by a
gions around u Ž1. or u Ž2. s 08 or 1808 in which the
sphere of radius r centered at point ax , Vi cj is the
number of counts is very small, which results in
volume of V ci j , R c is the maximum radius of the
poor accuracy there of the angular-distribution
coordination sphere Žassumed to be 8 A ˚ in this
functions. c c
The free energies of contact interaction were work., ri and r j are the contact radii of the side
calculated from protein crystal data using the qua- chains calculated from their volumes Žsee Table III
sichemical approximation procedure developed by in ref. 9..
Miyazawa and Jernigan.9 We chose this approach Because the relative weights of the angular,
because it takes into account the fact that competi- radial-angular, radial, and contact-free-energy
tive interactions between residues of different types terms in eq. Ž13. were not known a priori, estima-
occur in real proteins. We chose a radius of 8 A ˚ for tion of the parameters by minimization of expres-
the coordination sphere, which is greater than the sion Ž13. is a generalized least-square problem.61
6.5-A ˚ radius used by Miyazawa and Jernigan. The In such a case, the weights are usually estimated
reason for this was that not many hydrophilic as the inverses of the squares of the residuals, and
contacts are present within the 6.5 A ˚ coordination the resulting sum of squares is minimized succes-
sphere which results in poorer statistics. Then, we sively with iteratively updated weights, until the
scaled these free energies to be compatible with calculated and the assumed weights are consistent.
free energies of transfer of amino-acid side chains Thus, the estimates w ˜ ru , w
˜ r, w ˜ uf , and w ˜ F of the
from n-octanol to water determined by Fauchere weights of the terms in eq. Ž13. can be expressed

JOURNAL OF COMPUTATIONAL CHEMISTRY 857


LIWO ET AL.

by: ˚ and
and, those with a resolution not exceeding 2 A
having a chain length of at least 100 amino-acid
˜ r s 1rsr 2
w residues, were selected. Then, the percentage of
20 20 sequence homology was calculated for all pairs of
½
s Ž 1rSw . Ý Ý wi j Ž1rnri j .
is1 jsi
sequences using the FASTA program63, 64 available
on anonymous ftp from uvaarpa.virginia.edu.
nri j y1 Then, cluster analysis was carried out with the
2
minimal-tree algorithm,65 taking the values of
= Ý
ks1
Hirj; k Ž X . y Hˆirj; k
5 Ž100% y percentage homology. as distances be-
tween pair of structures. This grouped the selected
20 20
proteins into 154 families of homologous struc-
˜ uf s 1rsuf2
w
½ Ž 1rSwnu2 nf . Ý Ý wi j
is1 jsi tures. From each family, those structures were
taken that had the highest resolution or, if the
nu nu nf y1
2 resolution was the same, the longest chainŽs.. In
= Ý Ý Ý
ks1 ls1 ms1
Hiuf
j; k l m X
Ž . y ˆ Hiuf
j; k l m 5 several cases, however, both criteria were satisfied
by more than one structure. In such a case, we
˜ ru s 1rsru2
w took all the structures satisfying the criteria, di-
20 20
minishing their statistical weights when calculat-
ing the histograms of pair-correlation functions
½
s Ž 1rSwnu2 . Ý Ý wi j Ž1rnri j .
is1 jsi and contact free energies. The final list contained
195 structures, whose identities are summarized in
nri j nu nu y1
2 Table I of the Supplementary Material.
= Ý Ý Ý
ks1 ls1 ms1
Hirj;u k l m Ž X . y Hˆirj;u k l m
5
˜ F s 1rsF2
w Results and Discussion
y1
20 20
2 DISTRIBUTION AND CORRELATION
½
s Ž 1rSw . Ý Ý wi j
is1 jsi
Fi j Ž X . y Fˆi j
5 Ž 17.
FUNCTIONS
In all calculations, we assumed that the centers
where s 2 is a variance of the corresponding quan- of interactions are in the geometric centers of the
tity, Sw s Ý20 20
is1 Ý jsi wi j , and the other symbols side chains, calculated from the coordinates of the
are defined in eq. Ž13.. nonhydrogen atoms, including C a , as expressed
The standard deviations of the parameters were by:
estimated according to the Gauss]Markov for- NHi
mula62 : 1
Ri s Ý r ji Ž 19.
NHi js1
F Ž X*. y1
w s Ž x i .x 2 s w J T Ž X*. W Ž X*. J Ž X*.x i i Ž 18. where R i represents the coordinates of the geo-
Nyp
metric center of the ith side chain, r ji represents
where N s  210nu2 nf q Ž nu2 nf q 1. Ý20 20
is1 Ý jsi nr i j
the coordinates of the ith nonhydrogen atom of
q 210 is the total number of terms in eq. 13., p is
4 Ž the ith side chain, and NHi is the number of
the total number of parameters, Ji j s ­d ir­ x j is nonhydrogen atoms in side-chain i. The index i
the element of the first derivatives of the residuals denotes an individual side chain in the data base
d 1 , d 2 , . . . , d N that occur in the sum of the squares; and not the side-chain type. For glycine the posi-
W is the corresponding matrix of weights, and X* tion of the side-chain atom coincides with the
denotes the vector of the parameters at the mini- position of C a.
mum. When calculating the pair distributions and con-
tact free energies, we excluded disulfide-bonded
cystine pairs; however, the nonbonded cysteine
SELECTION OF PROTEIN STRUCTURES pairs were included. The weights of the structures
The protein crystal structures were taken from were calculated from the following formula:
the Brookhaven Protein Data Bank. First, the list of 1
available structures obtained from the PDB server wp s Ž 20.
pdb.pdb.bnl.gov Žas of June 25, 1994. was scanned n ch ai n n h o m res 2

858 VOL. 18, NO. 7


PROTEIN-STRUCTURE SIMULATIONS. I

where n ch ai n is the number of equivalent chains in quite well the Leu]Leu pair distribution function
a protein, n h o m is the number of homologous struc- Žcurve C of Fig. 3. for distances longer than 10 A.˚
tures, if more than one was taken from a family, The correlation function Žcurve A of Fig. 3. at
and res is the crystallographic resolution. The distances longer than 10 A ˚ is almost constant.
weights, wp , appear in equations in the Appendix. Greater deviations occur only at very long dis-
The single-body densities of the amino-acid side tances; this can be explained by the fact that the
chains were calculated from eq. ŽA-12. of the Ap- single-body density is determined with poor accu-
pendix, and were subsequently used in the evalua- racy at longer distances.
tion of the factor Ti j used for the calculation of To calculate the reference angular distribution
reference Ži.e., in the absence of any side-chain functions, we averaged the computed angular dis-
interactions. radial and radial-angular probability tribution functions over all pairs of side chains,
distributions w eqs. ŽA-7. and ŽA-9. of the Ap- using the method of Hao et al.67 Žsee eq. ŽA-13.
pendixx . A sample collection of radial pair densi- and the following text in the Appendix for details..
ties together with the reference and total pair dis- However, the angular pair correlation functions,
tribution functions is shown for the Leu]Leu pair calculated from eq. ŽA-5., still exhibited the behav-
in Figure 3. The radial distribution Žcurve C of Fig. ior of the ‘‘background’’ correlation function Žsolid
3. qualitatively resembles that calculated theoreti- curve of Fig. 4. for many pairs of residues. By
cally by Gan and Eu66 in their study of model van least-squares fitting, we found that the ‘‘back-
der Waals polymer chains. As shown, the distribu- ground’’ angular pair correlation function can be
tion calculated from single-body density and the described by e Ž3. of eq. Ž6.. Therefore, we included
Markovian factor Žcurve B of Fig. 3. approximates e Ž3. in the angular potentials.

FIGURE 3. Sample pair-distribution and pair-correlation functions for the Leu ]Leu pair averaged over consecutive
˚ shells. (A) Radial pair correlation functions girj ; (B) the reference pair distribution function n ( 2, 0 , r ) [denominator in
0.5-A
eq. A-4)]; and (C) the total pair distribution function n (2, r ) [numerator in eq. (A-4)]. All graphs were normalized to the
(
maximum values of 1.0.

JOURNAL OF COMPUTATIONAL CHEMISTRY 859


LIWO ET AL.

FIGURE 4. The angular correlation functions g uf Ž u Ž1., u Ž 2., f . averaged over all pairs of side-chain types and over the
angle f for displaying purpose. The surfaces were normalized so that 1 is the maximum value for both. Solid surface:
the function obtained from the PDB averaged over all the pairs of side chains. Dashed surface: the function obtained in
simulation studies. The latter were carried out by generating a total of 1000 50-residue energy-minimized chains with
random sequence confined to the ellipsoid characteristic of proteins of this size, according to the method of Hao et
al.,66 with the united-residue representation of polypeptide chains and the energy function developed in our earlier
work 38, 39 ; that is, the side-chain interaction potential did not have explicit angular dependence. The simulation study
shows that the background angular pair correlation function is not constant, even for a radial side-chain interaction
potential.

CONTACT FREE ENERGIES re-evaluated, the correlation equation became:


The calculated contact free energies, together ˚.
e iMj J s 0.844 Ž 0.026. e i j Ž R c s 6.5 A
with the numbers of contacts, are summarized in
Table I. q0.45 Ž 0.11. ; R s 0.9159 Ž 22.
Even though the set of 195 structures Žsee Table
Thus, the slope from eq. Ž22. is closer to 1.0, as
1 of the Supplementary Material. has almost no expected.
structure in common with that used by Miyazawa The correlation coefficient of our contact free
and Jernigan ŽMJ.,9 the computed contact free en- energies with the contact free energies derived by
ergies correlate well with those determined by MJ, Tanaka and Scheraga8 with a smaller data base,
the correlation equation being: with the definition of contact similar to that used
in our work except that their distances were mea-
˚.
e iMj J s 1.580 Ž 0.028. e i j Ž R c s 8 A sured between the a-carbons, is 0.8670. By con-
trast, the correlation of our contact free energies
q2.135 Ž 0.093. ; R s 0.9689 Ž 21. with those determined by Kolinski ´ et al.14 or Gre-
10
goret and Cohen is quite weak, the correlation
where R denotes the correlation coefficient, and coefficients being 0.6019 and 0.2261, respectively.
the numbers in parentheses the standard deviation This is understandable because the reference state
of the slope and intercept, respectively. When the in the latter two potentials corresponds to side
radius of the coordination sphere was taken as 6.5 chains arranged in a hydrophobic core and a hy-
˚ Žas used by MJ., and the contact free energies
A drophilic exterior Žthe Up h i l; p h o b state according to

860 VOL. 18, NO. 7


TABLE I.
Contact Free Energies ( RT Units; Diagonal and Upper Triangle) and Total Number of Counts Within the 8-A
˚ Coordination Sphere
(Lower Triangle, and the Last Line for Diagonal Elements) for Pairs of Amino-Acid Residues.

Cys Met Phe Ile Leu Val Trp Tyr Ala Gly Thr Ser Gln Asn Glu Asp His Arg Lys Pro

Cys y4.54 y4.72 y4.94 y4.90 y4.72 y4.53 y4.60 y4.08 y3.93 y3.68 y3.73 y3.56 y3.31 y3.14 y2.97 y3.12 y3.87 y3.02 y2.71 y3.50
Met 198 y4.80 y5.03 y4.95 y4.91 y4.59 y4.87 y4.30 y3.93 y3.40 y3.62 y3.35 y3.24 y3.04 y2.90 y2.80 y3.80 y3.15 y2.58 y3.43
Phe 476 735 y5.22 y5.25 y5.18 y4.89 y5.07 y4.43 y4.04 y3.52 y3.70 y3.46 y3.25 y3.17 y2.89 y2.95 y3.95 y3.26 y2.70 y3.54
Ile 541 817 1920 y5.29 y5.21 y5.01 y4.98 y4.51 y4.18 y3.61 y3.87 y3.58 y3.30 y3.13 y3.11 y3.09 y3.74 y3.42 y2.95 y3.62
Leu 739 1365 3123 3816 y5.03 y4.85 y4.87 y4.32 y4.06 y3.56 y3.61 y3.47 y3.17 y3.09 y2.85 y2.84 y3.69 y3.24 y2.74 y3.55

JOURNAL OF COMPUTATIONAL CHEMISTRY


Val 598 937 2365 3175 4731 y4.64 y4.63 y4.04 y3.93 y3.42 y3.58 y3.35 y3.05 y3.02 y2.84 y2.82 y3.44 y3.00 y2.69 y3.39
Trp 131 240 549 601 924 741 y4.74 y4.20 y3.81 y3.47 y3.44 y3.30 y3.25 y3.23 y3.04 y3.12 y3.94 y3.42 y2.84 y3.60
Tyr 283 464 962 1227 1753 1327 366 y3.69 y3.41 y3.12 y3.16 y2.97 y2.90 y2.82 y2.71 y2.84 y3.42 y3.01 y2.58 y3.23
Ala 679 1105 2071 2889 4762 4279 639 1445 y3.28 y2.91 y2.97 y2.76 y2.67 y2.59 y2.45 y2.52 y2.95 y2.56 y2.28 y2.81
Gly 788 750 1391 1878 3126 2944 561 1273 3750 y2.62 y2.78 y2.61 y2.39 y2.41 y2.12 y2.34 y2.77 y2.43 y2.08 y2.60
Thr 446 612 1220 1644 2140 2225 348 877 2543 2438 y2.81 y2.74 y2.46 y2.43 y2.33 y2.43 y2.97 y2.55 y2.13 y2.68
Ser 550 579 1068 1543 2335 2131 381 927 2379 2344 1737 y2.47 y2.34 y2.36 y2.22 y2.31 y2.76 y2.46 y2.01 y2.57
Gln 251 329 472 644 1056 899 219 475 1319 1146 801 840 y1.94 y2.26 y1.93 y2.09 y2.48 y2.25 y1.84 y2.35
Asn 250 290 659 766 1215 1238 295 631 1598 1566 1093 1172 598 y2.24 y2.14 y2.21 y2.60 y2.19 y1.95 y2.28
Glu 281 392 673 1071 1356 1354 281 679 1957 1447 1241 1407 604 919 y1.60 y1.77 y2.49 y2.61 y2.20 y2.09
Asp 308 343 718 1004 1238 1241 336 813 2096 1910 1410 1389 675 1103 867 y1.85 y2.70 y2.67 y2.22 y2.16
His 218 287 597 556 939 708 238 469 868 893 689 618 289 443 555 721 y3.27 y2.60 y2.02 y2.69
Arg 203 312 561 831 1252 932 269 603 1312 1343 953 1043 538 637 1308 1394 375 y2.03 y1.47 y2.29
Lys 262 394 705 1122 1481 1595 355 782 2077 1758 1324 1385 609 995 1721 1778 483 542 y1.14 y1.95
Pro 305 391 771 1000 1628 1371 354 743 1604 1507 1100 1143 561 763 821 803 434 600 865 y2.53

260 216 944 1237 3213 2185 97 360 2534 1793 842 971 223 452 421 504 261 270 443 396

861
PROTEIN-STRUCTURE SIMULATIONS. I
LIWO ET AL.

the terminology of Godzik et al.59 ., rather than to a due to side-chain]side-chain and backbone]back-
completely unarranged polypeptide chain Žthe U bone interaction. To obtain the peptide-group
state 59 ., which is the reference state in the ]peptide-group interaction free energy for any
Tanaka]Scheraga,8 MJ,9 and our approach. In the amino acid, we subtract from eG l y G l y Žobtained
14
´
Kolinski]Skolnick and Gregoret]Cohen10 poten- from the PDB. the contribution of the CH 2 group
tials, the nonspecific grouping of side chains into of Gly. We take the latter as y0.528 RT lnŽ10.p C H 2
the hydrophobic core and hydrophilic exterior is from eq. Ž24., with p C H 2 s 0.41 ŽRef. 60., which can
accounted for by one-body centrosymmetric poten- be considered as an estimate of the glycine ‘‘side-
tials, whereas in our approach it is encoded in the chain]side-chain’’ contact free energy. Then, we
side-chain pair potentials. rescale our contact free energies by introduction of
According to Miyazawa and Jernigan,9 the the factor 1.60 of eq. Ž23. for nonproline residues.
quantities 0.5qi e i , where qi is the coordination Further, because Pro has no backbone NH donor
number of residue of type i and e i s Ý20 is1 group, we have to reduce the corresponding esti-
Ni j e i jrÝ20
is1 Ni j is the average contact free energy mate of the peptide-group]peptide-group interac-
of residue of type i, can be regarded as hydropho- tion free energy by a factor 39 f P r o or f P r o P r o . Thus,
bicities of the corresponding types of residues. the estimates of experimental contact free energies
Therefore, we correlated these quantities with can be expressed by eq. Ž25.:
side-chain hydrophobicities determined by
Fauchere and Pliska ˇ who measured the partition RT
Fˆi j s
coefficients of amino acids between n-octanol and 1.60
water,60 and obtained the following correlation
equation:
¡e ij y Ž eG l y G l y q 0.528 ln Ž 10 . p C H 2 . ,
if both i and j / Pro
yRT = Ž 0.5qi e i . s 1.60 Ž 0.15.
= w RT ln Ž 10. p i x y 10.50 Ž 0.23. ; =~e ij y f P r o Ž eG l y G l y q 0.528 ln Ž 10. p C H 2 . ,
if only one of i or j s Pro
R s 0.9278 Ž 23.
e i j y f P r o P r o Ž eG l y G l y q 0.528 ln Ž 10. p C H 2 . ,
where T s 298 K and p i is the contribution of the ¢ if both i and j s Pro
side chain of type i to the logarithm of the parti-
Ž 25.
tion coefficient between n-octanol and water, as
ˇ 60 The correla-
determined by Fauchere and Pliska. Finally, it should be noted that the computed
tion graph is shown in Figure 5. contact free energies are additive to a good ap-
Similar correlation also holds with the diagonal proximation, which is reflected in the following
contact free energies, e i i : correlation equation:
yRTe i i s 0.528 Ž 0.046. e i j s 1.050 Ž 0.020.Ž e i i q e j j . r2
= w RT ln Ž 10. p i x y 1.197 Ž 0.070. ;
q0.072 Ž 0.068. ; R s 0.9669 Ž 26.
R s 0.9372 Ž 24.
in which the slope and intercept do not differ
The correlation with other hydrophobicity scales significantly from 1 and 0, respectively. The quan-
derived on the basis of thermodynamic data, for tity Ž e i i q e j j .r2 is called the ideal pair-interaction
example, those of Nozaki and Tanford,68 is worse free energy, while the quantity e i j y Ž e i i q e j j .r2 is
Žwith R s 0.8019.. The correlation is also worse called the excess pair-interaction free energy.59
Ž R s 0.8518. when the contact free energies ob- Equation Ž26., together with eq. Ž24. can serve to
tained with R c s 6.5 A ˚ are used. The latter fact is estimate the interresidue contact free energies of
understandable in view of the fewer number of non-natural amino acids which do not occur in the
contacts and therefore poorer statistics, especially data base of the structures of known proteins, but
for hydrophilic pairs. for which the water]n-octanol transfer free ener-
The slopes of eqs. Ž23. and Ž24. were used to gies can be measured directly or estimated from
estimate the ‘‘true’’ free energies of contacts imple- QSAR equations. It must be noted, however, that,
mented in the sum of squares given by eq. Ž13.. As for quite a number of side-chain pairs, there are
in our earlier work,38, 39 we considered the residue several outliers that depart from eq. Ž26. by more
contact free energies to be composed of the parts than the standard deviation; this is illustrated in

862 VOL. 18, NO. 7


PROTEIN-STRUCTURE SIMULATIONS. I

FIGURE 6. A diagram showing the pairs of side chains


with large excess free energies computed from the data
in Table I. The one-letter code of amino-acid residues is
used. The numbers in each box are the integers of the
ratio of the excess contact free energy to the mean-
'
square excess contact free energy, ² ee2x ce ss : = 0.23
RT [the standard deviation from the ideal contact free
FIGURE 5. Correlation between side-chain energy in eq. (26)].
hydrophobicities calculated from contact energies
(abscissae) and those determined by Fauchere and
and Glu]Glu. This is consistent with the results of
˘ 59 as given by eq. (22).
Pliska,
the work of Magalhaes et al.,69 in which it was
demonstrated that water bridges constitute an im-
portant factor stabilizing the spatially close config-
Figure 6. As shown, the excess free energy is less urations of the charged guanidino groups of the
than the standard deviation from the ideal free arginine side chains. Because the other charged
energy only when both side chains in a pair are residues do not possess so many groups capable of
hydrophobic, both are hydrophilic with zero net forming hydrogen bonds with water, the excep-
charge, or both have charges of the same sign. tionally low value of the Arg]Arg contact free
Therefore, eqs. Ž24. and Ž26. should, in principle, energy is understandable.
be applicable only to such pairs of side chains. For
pairs composed of one hydrophobic and one hy-
drophilic side chain the excess free energy is posi- DETERMINATION OF PARAMETERS AND
tive, which means that making such a contact is STATISTICAL EVALUATION OF THE FIT OF
POTENTIALS TO EXPERIMENTAL DATA
less favorable than predicted by the ideal contact
free energy. This is understandable because, for The sum of the squares defined by eq. Ž13. was
example, for hydrophilic side chains bearing hy- minimized for all five potentials given by eqs.
drogen-bonding groups, making a contact with a Ž2. ] Ž8.. The angular and radial-angular terms were
nonpolar side chain, as opposed to making a con- not considered in determining the parameters of
tact with another hydrogen-bonding group, results radial-only potentials, namely LJ and LJK w eqs. Ž3.
in breaking of hydrogen bonds. The excess free and Ž4.x . We used the Marquardt algorithm,70
energies of the interaction of charged side chains which is especially designed for minimizing the
of opposite signs are strongly negative, about y0.7 sums of squares. This method requires only the
RT, which can be explained in terms of the forma- first derivatives of the components of the sums of
tion of salt bridges. Finally, some small negative squares, from which both the gradient and the
excess contact free energies occur for pairs com- positive-definite approximate to the Hessian ma-
posed of side chains with carboxyamide groups trix are constructed.
and positively charged side chains. Numerical integration to calculate the average
It should also be noted that the Arg]Arg con- correlation functions and free energies was carried
tact free energy is considerably more negative than out with step sizes of d D s 0.25 A, ˚ dq s pr24,
the contact free energies of the other pairs of and d w s pr6, by taking the value of the function
residues with equal charges: Lys]Lys, Asp]Asp, in the center of the respective bin as the average

JOURNAL OF COMPUTATIONAL CHEMISTRY 863


LIWO ET AL.

value of the function in the bin. These step sizes from isotropic potentials. Both gave the same final
were chosen as a compromise between computa- results.
tional efficiency and the error caused by too coarse Based on eqs. Ž17., the final estimated ratios of
a grid in the integration. Use of a finer grid re- the weights of eq. Ž13. were w r :w uf :w ru :w F s
sulted in differences in free energies and his- 1:20:20:20 for all models.
togram values less than 1%. To increase computa- Equation Ž13. contains pair-specific and single-
tional efficiency, minimization was first carried residue-specific terms. We initially made trial runs
˚ dq s
out with a coarse grid Ži.e., d D s D r s 0.5 A, by assuming that all the parameters are pair-
d w s pr6. and then completed Žstarting from the specific; that is, we avoided the relations in eqs.
computed parameters. using a finer grid Žd D s Ž5., Ž9., and Ž10.. However, for most of the side-
˚ dq s pr24, and d w s pr6..
0.125 A, chain pairs, the results were unreasonable, with
The starting values of e 8 were the free energies the standard deviations exceeding the parameter
of contacts calculated from eq. Ž25.. The values of values. Therefore, we decided to use eqs. Ž5., Ž9.,
s 8 in the LJK, GB, and GBV potentials and the and Ž10. to express all the constants except e i j in
values of r8 in the LJ potential were initially as- terms of single-body constants.
signed half the side-chain van der Waals distances The fit of the functional forms of the potentials
calculated by Levitt.2 The initial values of rTi j in considered in this work to experimental data is
the GBV and LJK potentials were 1.3 A, ˚ this being compared in terms of the F-test 71 in Table II for
the approximate van der Waals radius of the the radial and anisotropic potentials, respectively.
‘‘outer’’ atoms of the side chains. For the In the case of the radial potentials, the LJK model
anisotropic parameters Ž s 5rs H . 2 and x X , one Žshifted Lennard]Jones. appears clearly superior
choice was based on the ratio of the long and the to the simple LJ model, the level of significance of
geometric mean of the shorter principal axes of the introducing the ‘‘shift’’ parameters r8 being close
moments of inertia of the side chains calculated by to 100%. A similar situation occurs for the
averaging their geometry, and another start was anisotropic potential where the GBV functional

TABLE II.
Comparison of Fit of Various Radial and Anisotropic Potentials to Experimental Data.

Potential Fa F uf F ru Fr F F = 1000 pb DF c Fd

LJK 0.39 } } 0.383 0.233 250 } }


LJ 1.35 } } 1.340 0.433 230 0.960 468

GBV 9.26 0.223 0.220 0.380 1.053 310 } }


LJK e 10.23 0.230 0.261 0.384 0.952 250 0.975 871
GB 9.87 0.222 0.247 0.465 1.031 290 0.615 549
BP 9.75 0.222 0.241 0.462 1.625 290 0.493 }
a
F, F uf , F ru , F r , F F are defined in eq. (13). For the LJ and LJK (radial) potentials F s w r F r q w F F F , for the remaining
(anisotropic) potentials F s w uf F uf q w ru F ru q w r F r q w F F F , where w r = 1 and w uf s w ru s w F s 20 have been assigned
based on the ratios of the respective residual variances (see text).
b
The number of adjustable parameters.
c
The difference between the fit corresponding to the potential giving an inferior fit to experimental data with that of the potential
giving the best fit (i.e., LJK for the radial and GBV for the anisotropic potential).
d
The F-test value to compare the goodness of fit of the inferior potential with that of the best one is:

N y pi F (X i ) y F (X* )
Fi =
p* y pi F ( X* )

where N is the number of terms in the expression for F; N = 3451 for the radial and 165,487 for the anisotropic potentials,
X* = ( x1U , x U2 , . . . , x Up* )T is the vector of the parameters of the best model, and X i = ( x i;1 , x i;2 , . . . , x i,p i )T is the vector of the
parameters of ith inferior model pi - p* (see ref. 70). With the large values of N taken in this study, the best-fitting potentials are
effectively different from the inferior ones at the 100% significance level. Because the model with the BP potential is not nested in
the model with the GBV potential, the F-test value is not given in this case.
e
The whole sum of squares was minimized, but with the radial LJK potential, which can be considered as the GBV potential devoid
of the angular terms.

864 VOL. 18, NO. 7


PROTEIN-STRUCTURE SIMULATIONS. I

form, which allows for free ‘‘shifting’’ of r in eq. als show an apparent trend: the negative ones
Ž7., appears superior to both the BP and GB forms occur for negative and positive ones for positive
Žhowever, BP is a model not nested in GBV and, contact free energies. This trend is partially elimi-
therefore, we did not carry out its statistical com- nated for the BP potential and remains only for
parison in terms of the F-test.. Again, the signifi- strongly hydrophilic pairs in the case of the GBV
cance level of introducing the new class of param- potential. The last observation may indicate that
eters r8, when passing from GB to GBV, is effec- none of the models is fully adequate for hy-
tively 100%. Of the two potentials with fewer drophilic pairs. On the other hand, such pairs
parameters, the BP form gives a better fit to the interact weakly and therefore this should not cause
experimental data. From Table II, it also follows great concern.
that the LJK potential gives a significantly poorer Sample theoretical and experimental histograms
fit to the data that contain both radial and angular of the radial and angular correlation functions Žthe
terms, that is, the angular terms are statistically latter being averaged for visualizing purposes over
significant. the dihedral angle f . corresponding to the
The lesser adequacy of the GB form, compared Leu]Leu pair are shown in Figure 8A and 8B.
with the BP and GBV ones, also follows from the To test how the parameters of the side-chain
plot of the residuals in the contact free energies interaction potentials depend on the choice of the
shown in Figure 7. For the GB potential, all residu- data base, we evaluated the parameters of the

FIGURE 7. Plots of weighted residuals of the contact energies corresponding to the BP (crosses), GB (squares), and
GBV (diamonds) potentials.

JOURNAL OF COMPUTATIONAL CHEMISTRY 865


LIWO ET AL.

FIGURE 8. Sample calculated (dashed line and dashed surface) and experimental (solid line and solid surfaces)
histograms of radial (A) and angular (B) correlation function for the Leu ]Leu pair. For visualizing, the histograms of the
angular correlation functions are averaged over the dihedral angle f .

GBV potential, using the set of 42 protein struc- dard deviations. For hydrophobic residues, the
tures of Miyazawa and Jernigan9 ; the GBV poten- values of the well-depth anisotropy x X are small,
tial contains the greatest number of adjustable although the anisotropy measures of the van der
parameters and should, therefore, be the most sen- Waals radii s 5rs H are significantly different from
sitive to data-base selection. ŽIn the section ‘‘Con- 1.0. Well-depth anisotropies appear significant for
tact Free Energies,’’ we have already shown that neutral and hydrophilic residues.
the contact free energies determined from our data It is interesting to compare the computed pa-
base of 195 protein structures are in very good rameters with contact free energies and estimates
agreement with those determined by Miyazawa of the van der Waals radii and anisotropies that
and Jernigan.. For the values of e ( of eq. Ž6., can be determined from the geometrical character-
which range from y12 to q1.6, the correlation istics of the side chains. Such comparison is pre-
coefficient was 0.8569 and the mean-square differ- sented for the GBV model in Figure 9A]D.As
ence was 0.3 kcalrmol. For other parameters for shown, the contact free energies of hydrophobic
which the range is not so extensive, correlation pairs Žfor which e i j ) 0. correlate quite well with
coefficients of approximately 0.8 were obtained. In their van der Waals well depths determined by
view of the fact that the two data bases have minimizing F of eq. Ž13.. The values of r8; L, used
almost no structure in common and the MJ data as the van der Waals distances in our earlier work,
base is much smaller than ours, the parameters of correlate with the values of s T , with the definite
the GBV potential determined from the two data exception of aromatic residues and arginine ŽFig.
bases are reasonably consistent. 9B.. A similar situation occurs when the values
corresponding to the LJK potential are taken into
account. The correlation is even better Žwith the
DISCUSSION OF COMPUTED PARAMETERS
exception of Lys. when the values of s 5 calcu-
The computed values of eTi j and the single-body lated from s 8 and the ratio s 5rs H are consid-
parameters of eq. Ž6. and their standard deviations ered ŽFig. 9C.. On the other hand, there is no
for the GBV side-chain interaction potential con- correlation between the values of r8 from our
sidered in this work are given in Tables IIIa and earlier work and the constants r8 of Eq. Ž7..
IIIb. The parameters for the other four simpler There is no correlation between the parameters
potential functions are included in Tables 2a, 2b to and the ratio of the long to the short axes of the
5a, 5b of the Supplementary Material, which also side chains determined by diagonalizing the aver-
contains the parameters for all five potentials in age matrices of the moments of inertia determined
machine-readable form. from the PDB. Thus, estimating these parameters
Except for eTL y s L y s of the GBV model, and the based on the average dimensions of a side chain is
constants x X and a, the parameters are well deter- incorrect. On the other hand, it is interesting to
minable and significantly greater than their stan- note that anisotropy parameters correlate with the

866 VOL. 18, NO. 7


TABLE IIIa.
Calculated Values of eTi j (Kilocalories per Mole) for the GBV Potential (Diagonal and Lower Triangle) and Their Standard Deviations
(Upper Triangle; Last Line Contains the Standard Deviations of the Diagonal Constants).

Cys Met Phe Ile Leu Val Trp Tyr Ala Gly Thr Ser Gln Asn Glu Asp His Arg Lys Pro

Cys 1.05 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.03 0.02 0.02 0.02 0.02 0.02 0.02 0.02
Met 1.26 1.45 0.01 0.01 0.01 0.01 0.02 0.02 0.02 0.02 0.02 0.02 0.03 0.03 0.03 0.02 0.02 0.02 0.03 0.02
Phe 1.19 1.34 1.27 0.01 0.01 0.01 0.02 0.01 0.01 0.01 0.01 0.01 0.02 0.02 0.01 0.01 0.02 0.01 0.01 0.01
Ile 1.30 1.47 1.41 1.58 0.01 0.01 0.02 0.01 0.01 0.01 0.01 0.01 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.01
Leu 1.25 1.51 1.40 1.59 1.55 0.01 0.02 0.01 0.01 0.01 0.01 0.01 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.01

JOURNAL OF COMPUTATIONAL CHEMISTRY


Val 1.17 1.38 1.31 1.52 1.50 1.40 0.02 0.01 0.01 0.01 0.01 0.01 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.01
Trp 0.99 1.17 1.15 1.21 1.18 1.10 0.97 0.02 0.02 0.02 0.01 0.01 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02
Tyr 0.92 1.15 1.05 1.22 1.18 1.04 0.87 0.81 0.01 0.01 0.01 0.01 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.01
Ala 0.98 1.20 0.99 1.24 1.26 1.19 0.77 0.81 1.02 0.01 0.01 0.01 0.03 0.02 0.02 0.02 0.02 0.02 0.02 0.01
Gly 0.98 1.03 0.84 1.06 1.13 1.01 0.71 0.72 0.82 0.56 0.01 0.01 0.02 0.02 0.01 0.02 0.02 0.02 0.00 0.01
Thr 0.80 0.91 0.76 0.98 0.90 0.87 0.56 0.58 0.64 0.55 0.43 0.01 0.02 0.02 0.02 0.01 0.02 0.02 0.00 0.01
Ser 0.77 0.86 0.68 0.90 0.93 0.83 0.51 0.52 0.59 0.47 0.45 0.28 0.02 0.02 0.01 0.02 0.02 0.02 0.00 0.01
Gln 0.81 1.02 0.72 0.95 0.98 0.85 0.59 0.63 0.75 0.33 0.35 0.26 y0.28 0.06 0.03 0.02 0.03 0.04 0.01 0.02
Asn 0.73 0.95 0.70 0.87 1.00 0.89 0.60 0.60 0.77 0.49 0.38 0.38 0.53 0.66 0.01 0.04 0.03 0.05 0.00 0.02
Glu 0.64 0.81 0.53 0.88 0.79 0.72 0.52 0.51 0.47 y0.06 0.20 0.04 y0.23 y0.02 y1.58 0.10 0.03 0.04 0.04 0.02
Asp 0.68 0.64 0.52 0.79 0.68 0.62 0.53 0.57 0.51 0.23 0.29 0.12 y0.12 0.27 y0.93 y0.66 0.03 0.03 0.05 0.02
His 0.91 1.05 0.92 0.94 0.98 0.83 0.82 0.76 0.65 0.56 0.57 0.49 0.38 0.59 0.42 0.62 0.80 0.03 0.00 0.02
Arg 0.58 0.87 0.69 0.96 0.95 0.75 0.66 0.67 0.53 0.38 0.43 0.40 0.36 0.33 1.01 1.00 0.49 y0.02 0.09 0.02
Lys 0.59 0.81 0.55 0.96 0.97 0.85 0.50 0.62 0.68 y0.01 0.00 y0.01 y0.02 0.00 1.30 1.09 y0.01 y0.48 y11.96 0.02
Pro 0.82 0.95 0.81 0.98 1.00 0.92 0.77 0.79 0.74 0.69 0.57 0.58 0.62 0.62 0.42 0.42 0.61 0.53 0.56 0.82

0.03 0.03 0.02 0.01 0.01 0.01 0.03 0.02 0.02 0.02 0.01 0.02 0.07 0.10 0.25 0.09 0.04 0.01 18. 0.02

867
PROTEIN-STRUCTURE SIMULATIONS. I
LIWO ET AL.

TABLE IIIb.
Calculated Values of Single-Body Parameters of the GBV Potential. (Standard Deviations in Parentheses.)
Except for s 0 and r 0 All Quantities Are Dimensionless.

Residue s T ŽA
˚. rT ŽA
˚. Ž s 5rs H . 2 xX a

Cys 2.3204 (0.0382) 5.7866 (0.2123) 2.6006 (0.2042 ) y0.0025 (0.0155) 0.0299 (0.0070 )
Met 2.4984 (0.0237) 3.5449 (0.1299) 4.4303 (0.2018 ) 0.0968 (0.0108 ) 0.0878 (0.0058 )
Phe 2.2823 (0.0245) 6.3367 (0.1459) 3.9640 (0.1815 ) 0.0491 (0.0077 ) 0.0801 (0.0039 )
Ile 2.5919 (0.0150) 4.4859 (0.0860) 3.2406 (0.0926 ) 0.0897 (0.0061 ) 0.0664 (0.0031 )
Leu 2.8905 (0.0098) 3.3121 (0.0548) 2.3636 (0.0406 ) 0.0749 (0.0047 ) 0.1108 (0.0027 )
Val 2.7251 (0.0125) 3.7770 (0.0629) 2.0347 (0.0514 ) 0.0770 (0.0062 ) 0.0679 (0.0032 )
Trp 1.6947 (0.0644) 9.2904 (0.3832) 7.5089 (1.0060 ) 0.0731 (0.0170 ) 0.0549 (0.0076 )
Tyr 2.1346 (0.0271) 4.8607 (0.1529) 5.9976 (0.3578 ) 0.1177 (0.0134 ) 0.0438 (0.0064 )
Ala 2.4366 (0.0100) 2.1574 (0.0423) 1.8090 (0.0396 ) 0.0333 (0.0077 ) 0.1052 (0.0041 )
Gly 2.3359 (0.0169) 2.5197 (0.0532) 1.0429 (0.0498 ) 0.2238 (0.0127 ) y0.1277 (0.0073)
Thr 2.6047 (0.0188) 3.0723 (0.0996) 2.2451 (0.0899 ) 0.0236 (0.0162 ) y0.0264 (0.0075)
Ser 2.4471 (0.0203) 2.2432 (0.0770) 1.6795 (0.0749 ) y0.0029 (0.0184) y0.0348 (0.0083)
Gln 2.6269 (0.0229) 1.1813 (0.1189) 2.6172 (0.1455 ) 0.2960 (0.0305 ) 0.0505 (0.0165 )
Asn 2.6954 (0.0165) 0.7634 (0.0826) 2.0433 (0.0850 ) 0.2732 (0.0286 ) y0.0064 (0.0152)
Glu 2.5933 (0.0191) 1.2819 (0.0874) 2.5707 (0.1327 ) 0.4904 (0.0332 ) y0.0266 (0.0175)
Asp 2.5098 (0.0192) 1.4061 (0.0804) 1.9262 (0.0925 ) 0.3090 (0.0299 ) 0.0250 (0.0164 )
His 2.3409 (0.0323) 3.3570 (0.1817) 3.6263 (0.2703 ) 0.1351 (0.0245 ) 0.0589 (0.0124 )
Arg 2.3694 (0.0214) 1.8119 (0.1201) 6.6061 (0.3758 ) 0.2624 (0.0270 ) 0.0062 (0.0130 )
Lys 2.7249 (0.0161) 0.2712 (0.0913) 8.0078 (0.2948 ) 0.5790 (0.0364 ) 0.0115 (0.0161 )
Pro 2.7230 (0.0228) 3.3320 (0.1059) 1.7905 (0.0759 ) y0.1105 (0.0160) y0.0190 (0.0065)

dimensions of the side chains; larger side chains where x i j is defined by eqs. Ž4. and Ž7. for the LJK
are more likely to exhibit more pronounced and GBV potentials, respectively, and hŽ y . is the
anisotropy ŽFig. 9D.. step function of y; hŽ y . s 0 for y F 0 and hŽ y . s 1
Finally, it should be noted that, in the case of otherwise.
the LJK and GBV potentials, for many of the side Thus, the modified expression includes a short-
chains, the constants, r8, exceed s 8 Žsee Table 3b range ‘‘repulsive core’’ potential which prevents
and Table 3b of the Supplementary Material.. Par- the collapse of the side chains. Introduction of this
ticularly large r8 values occur for the aromatic repulsive core does not impair the fit of the poten-
residues which exhibit broad radial distributions. tials to the experimental data.
This means that the potential will rarely tend to
infinity as side-chain separation approaches zero.
This does not seem to be the result of inadequacy
of the fitting procedure, because we included the Conclusions
regions in which the radial-correlation function is
zero. We have also carried out additional trial runs We have parameterized several functional forms
by assuming lower exponents in eq. Ž2. than 6 and for the potential of mean force of side-chain]side-
12, which results in broadening the potential wells. chain interactions that are based on reasonable
However, we still obtained r8 ) s 8 for most of the site]site interaction potentials used in molecular
side chains. Thus, to use the LJK and GBV poten- simulations. The parameters of the potentials have
tials in simulations, in these two cases we changed been determined consistently by fitting the energy
the general form of the potential given by eq. Ž2. expressions to the correlation functions and con-
to: tact free energies obtained from high-resolution
protein crystal data. Compared to related work on
Ui j s 4 < e i j < x 12 deriving the mean-field potentials from protein-
i j y ei j x i j
6
crystal data,24 ] 31, 34 ] 36 our approach has two new
12
1
siTj features: inclusion of anisotropy of the free-energy
4eTi j h Ž rTi j siTj .
2
q y
ž / ri j
Ž 27. surface, and explicit use of thermodynamic data to
rescale the free energies of contacts w eq. Ž25.x . The

868 VOL. 18, NO. 7


PROTEIN-STRUCTURE SIMULATIONS. I

FIGURE 9. (A) Correlation between the hydrophobicity-scaled contact free energies, Fi j (abscissae), and the
corresponding van der Waals well depths, eT ( )
i j , corresponding to the GBV potential ordinates . The straight line is the
least-squares line calculated for the ‘‘definitely hydrophobic’’ pairs with e i j G 0 kcal / mol; its equation is e
= y0.865(0.040) F + 0.425(0.022); R = y0.8417. (B) Correlation between the mean radii of side-chain contacts derived
by Levitt 2 (abscissae) with the computed values of s T of the GBV potential. After eliminating five apparent outliers: Phe,
Tyr, Trp, His (aromatic side chains), and Arg, the equation is s T = 0.1552(0.041) r 0;L + 1.72(0.23); R = 0.7254. (C)
Correlation between Levitt’s mean side-chain contact distances and the values of s 5 calculated from the parameters of
the GBV potential. After eliminating lysine, the equation is s 5 = 0.89(0.13) r 0;L y 0.94(0.75); R = 0.8619. (D) Correlation
between Levitt’s mean side-chain contact distances and the values of s 5rs H corresponding to the GBV potential. After
eliminating lysine, the equation is s 5rs H = 0.463(0.072) r 0;L y 0.97(0.43); R = 0.8417.

JOURNAL OF COMPUTATIONAL CHEMISTRY 869


LIWO ET AL.

second feature enables us to consider the com- Computations were carried out with one pro-
puted free energies of side-chain interactions as cessor of the IBM-SP2 computer at the Cornell
absolute values that can be compared directly with National Supercomputer Facility, a resource of the
experimental data andror the results of calcula- Center for Theory and Simulation in Science and
tions with all-atom potentials Žincluding hydra- Engineering at Cornell University, which is funded
tion.. by the National Science Foundation, New York
The choice as to which potential to use in simu- State, the IBM Corporation, and members of its
lations should be based on the balance between Corporate Research Institute, with additional funds
the accuracy of the representation of free-energy from the National Institutes of Health.
surface and computational effort. Regarding these
two issues, the potentials can be ordered as fol-
lows: GBV, BP and GB, LJK, LJ. The GBV potential Appendix: Definition and Calculation
Žthat includes angular dependence. most accu- of Side-Chain Pair Correlation
rately represents the free-energy surface, but in- Functions from Protein-Crystal Data
volves the greatest computational effort, whereas
the LJ potential Žradial-only. is the simplest, but Assume that we have a data base of np protein
least accurate representation of the free-energy structures. Let n iŽ2. j; p r, u
Ž Ž1.
, u Ž2., f . denote the num-
surface, and should be used when the computation ber density of pairs of side chains of types i and j
time is a significant issue. at a distance r and orientation defined by the
angles u Ž1., u Ž2., and f for protein p, all assumed
to be at the same temperature Žfor brevity of nota-
Acknowledgments tion, we omit the side-chain-pair subscripts ij in
the symbols of the variables throughout the Ap-
This work was supported by Grant PB pendix.. Because we cannot determine the actual
190rT09r96r10 from the Polish State Committee density at a point from experimental data, instead
for Scientific Research ŽKBN. Žto A.L. and S.O.., by we will consider n iŽ2. j; p r, u
Ž Ž1.
, u Ž2., f ; T . defined
Grant AG 00322 from the National Institute on as the average number density in the bins
Aging Žto S.R.., by Grant GM-14312 from the Na- b k l m n s bŽ r k , u lŽ1., umŽ2., fn . s w r k y D rr2, r k q Dr2 x
tional Institute of General Medical Sciences, by =w u lŽ1. y D u r 2, u lŽ1. q D ur2x = w umŽ2. y D u r 2, umŽ2.
Grant MCB95-13167 from the National Science qD ur2x = w f n y D f r2, f n q D f r2x , k s 1,
Foundation Žto H.A.S.., and by Grant CA 42500 2, . . . , nr, l s 1, 2, . . . nu , m s 1, 2, . . . nu , n s 1,
from the National Cancer Institute Žto M.R.P... 2, . . . nf :

number of pairs of side chains of types i and j within b k l m n


n iŽ2.
j; p r k , u l , um , f n s
Ž Ž1. Ž2. . Ž A-1.
volume of b k l m n

where r k s Ž k y1r2. D r, u lŽ1. s Ž l y 1r2. D u , umŽ2. s and reference pair number densities that can be
Ž m y 1r2. D u , fn s Ž n y 1r2. D f , D r s 0.5 A, ˚ Du evaluated from the data base of protein structures,
T T
s 30 , D f s 30 , nri j s int ri j rD r , nu s pr6,
Ž max . the pair correlation function can be calculated as
nf s 2pr6, where int is the integer part of a num- follows:
ber; the values of r m a x are defined by eq. Ž12..
The pair number density n Ž2. can be decom- Ý nps
p
1 wp n i j; p r , u
Ž2. Ž Ž1.
, u Ž2. , f .
posed into the pair correlation function for residues g i j Ž r , u Ž1. , u Ž2. , f . s
Ý nps 1 wp n i j; p r , u Ž1. , u Ž2. , f .
p Ž2 , 0. Ž
of types i and j, g i j Ž r, u Ž1., u Ž2., f . Žassumed to
depend only on the types of the side chains and Ž A-2.
not on the protein in which they reside. and the
reference-state pair number density n i2,0 j; p r, u
Ž Ž1.
, where wp is the statistical weight of the pth pro-
u , f ., corresponding to a hypothetical chain with
Ž2.
tein in the sample; the choice of weights is dis-
noninteracting side chains. Thus, given the actual cussed in the Results section w eq. Ž20.x .

870 VOL. 18, NO. 7


PROTEIN-STRUCTURE SIMULATIONS. I

Clearly, g is the average pair correlation func- g iuf


j u ,u ,f
Ž 1 2 .
tion corresponding to bin b k l m n :
3
s
rm3 a x D cos u D cos u Ž2.D f
Ž1.
g i j Ž r , u Ž1. , u Ž2. , f .
rm a x uq
H0 Huu Huu Hf
Ž1. Ž2.
s Ž 1rDV . = q q
g i j Ž D ; q Ž1. , q Ž2. , w . dV
Ž1. Ž2.
y y y
rq fq
Hr Huu Huu Hf
Ž1. Ž2.
= q q
g i j Ž D , q Ž1. , q Ž1. , w . dV Ž2 , uf . Ž Ž1.
Ý nps 1 wp n i j; p u , u Ž2. , f .
p
f
Ž1. Ž2.
y y y y
Ž0 ,2, uf . Ž Ž1.
Ž A-5.
Ý nps 1 wp n u , u Ž2. , f .
p
Ž A-3.
g irju Ž r , u Ž1. , u Ž2. .
where:
1
s
rys r y D rr2, rqs r q D rr2, 2p D r D cos u Ž1.D cos u Ž2.
3

rq 2p
uyŽ1. s u Ž1. y D ur2, uqŽ1. s u Ž1. q D ur2, Hr Huu Huu H0
Ž1. Ž2.
= q q
g i j Ž r ; q Ž1. , q Ž2. , w . dV
Ž1. Ž2.
y y y
uyŽ2. s u Ž2. y D ur2, uqŽ2. s u Ž2. q D ur2,
Ž2 , ru . Ž
Ý nps 1 wp n i j; p r , u Ž1. , u Ž2. .
p
fys f y D fr2, fqs f q D fr2, f Ž A-6.
Ž0 ,2, ru . Ž Ž1.
Ý nps 1n u , u Ž2. , f .
p
dV s D sin q2 Ž1.
sin q Ž2.
d D dq Ž1.
dq Ž2.
dw
where g r , g uf , and g ru denote the correlation
and: functions averaged over all angles Ž u Ž1., u Ž2. and
f ., r, and the rotation angle, f , respectively; we
DV s Ž 1r3.Ž rq
3
y ry
3 .Ž
cos uyŽ1. y cos uqŽ1. . noticed that the dependence of the distribution
function on f is the weakest and, therefore, chose
= Ž cos uyŽ2. y cos uqŽ2. . D f
to average over f to obtain a mixed radial and
s Ž 1r3. D r 3D cos u Ž1.D cos u Ž2.D f angular correlation function w eq. ŽA-6.x . Likewise,
n Ž2, r . Ž r ., n Ž2, uf . Ž u Ž1., u Ž2., f ., and n Ž2, ru . Ž r, u Ž1., u Ž2. .
The limited number of available protein struc- denote the average number densities within
tures still makes it impossible to determine the w ry, rq x , w uyŽ1., uqŽ1. x = w uyŽ2., uqŽ2. x = w fy, fq x , and
pair correlation functions with reasonable accu- w ry, rq x = w uyŽ1., uqŽ1. x = w uyŽ2., uqŽ2. x , respectively.
racy. This is easily realized because, even the choice The reference pair distribution functions must
of a coarse grid of D r s 0.5 A, ˚ D u s D f s 308 still be defined. We assume that they can be de-
with implementation of the symmetry of the hy- composed into the radial and angular part, and
persurface in f Žonly the interval w 08, 1808x needs that the radial part can be expressed as a product
to be considered. yields 16 = 6 = 6 = 6 s 3456 of the Markovian factor, Mi j; p Ž r ., arising from the
bins w according to eq. Ž12. we take a maximum 8-A ˚ fact that the side chains are on a polypeptide
coordination sphere for a residuex for which the chain,9 a factor accounting for the finite dimen-
average correlation functions would have to be sions and nonuniform residue density, Ti j; p Ž r .,72 of
determined. Within this coordination sphere, we protein molecules, and the ‘‘background’’ angu-
have at best about 5000 points per residue pair, lar distribution function, V i j Ž u Ž1., u Ž2., f ., as given
which would mean an average of about 1.4 counts by eqs. ŽA-7. ] ŽA-9..
per bin. Therefore, in the fitting procedure we use
np
the correlation functions averaged over some ra-
n Ž0, 2, r . Ž r . s Ý wp Mi j; p Ž r . Ti j; p Ž r . Ž A-7.
dial and angular variables, respectively: ps1

g irj Ž r . n Ž0, 2, uf . Ž u Ž1. , u Ž2. , f . s Ni j Ž r F R c . V Ž u Ž1. , u Ž2. , f .


Ž A-8.
1 rq p p 2p
s Hr H0 H0 H0 g i j Ž D ; q Ž1. , q Ž2. , w . dV n Ž0, 2, ru . Ž r , u Ž1. , u Ž2. .
8p D r 3
y
2p
Ý nps
p
1 wp n i j; p
Ž2 , r . Ž .
r s Ž 1r2p . Mi j; p Ti j; p Ž r . H0 V Ž u Ž1. , u Ž2. , w . d w
f Ž A-4.
Ý nps 1 wp n i j; p
p Ž0 , 2 , r . Ž .
r Ž A-9.

JOURNAL OF COMPUTATIONAL CHEMISTRY 871


LIWO ET AL.

where Ni j Ž r F R c . s Ý nis1
p
wp n i j; p Ž r F R c . is the The angular reference function V i j Ž u Ž1., u Ž2., f .,
weighted total number of side chains of types i was calculated as the average of the angular corre-
and j in the protein-structure data base, which are lation functions, g Ž u Ž1., u Ž2., f ., averaged over
separated by at least 10 peptide groups and whose side-chain types:
distance is less than the assumed radius, R c , of the 20 20
coordination sphere of a residue Žassumed to be 8
˚ in this work..
A
V Ž u Ž1. , u Ž2. , f . s 1r210 Ý Ý g iuj f Ž u Ž1. , u Ž2. , f .
is1 jsi
We assumed the reference function to be inde- Ž A-13.
pendent of side chain and protein; that is, it in-
cludes all side chains in all the proteins used. This ‘‘background’’ angular correlation func-
The components Mi j; p Ž r . and Ti j; p Ž r . of the ra- tion, n Ž0, 2, u , f ., is qualitatively similar to the func-
dial reference functions are defined as follows9, 72 : tion that we obtained in a test Monte Carlo simula-
tion of 1000 random-sequence and random-confor-
Mi j; p Ž r . s Ý n i j; k ; p P Ž r ; k . Ž A-10. mation 50-residue polypeptide chains confined to
kG10
an average volume 67 characteristic of a 50-residue
where k is the number of peptide groups separat- protein, using the united-residue potential of our
ing side chains of types i and j, n i j; k; p is the total earlier work 38, 39 and a procedure developed by
number of pairs of side chains of types i and j Hao et al. for confined-space simulations.67 The
separated by k peptide groups in the data base of united-residue force field did not include any
protein crystal structures, and P Ž r; k . is the side-chain anisotropy.38, 39 Nevertheless, the aver-
Markovian probability density that two side chains age angular correlation function exhibits some
separated by k peptide groups are at the distance anisotropy ŽFig. 4.. This results from the fact that
r; we assumed the form given by eq. Ž25. of ref. 9 one end of each side chain is tethered to the
for P Ž r; k .. From ref. 72: backbone and therefore the ‘‘free’’ ends of the side
chains can easily approach each other, whereas the
Ti j; p Ž r . tethered ends cannot. The similarity of the ‘‘back-
ground’’ angular correlation function obtained
1
s HS HS lSŽx, r .r Ž x . r j; p Ž y . d 2 yd 3 x from protein crystal data to the function obtained
4p r 2 Vp p p
i; p
in simulations with a radial potential Žshown in
Ž A-11. Fig. 4. justifies its use as the reference angular
correlation function.
where S p and Vp denote the region of space occu-
pied by the pth protein and the volume of this
region, respectively, SŽx, r . is the sphere of radius References
r centered at the point x, and r i and r j are the
average single-body densities of residues of types 1. M. Levitt and A. Warshel, Nature, 253, 694 Ž1975..
i and j; we assume that they depend only on the 2. M. Levitt, J. Mol. Biol., 104, 59 Ž1976..
ratio of the distance from the center of a protein to 3. M. R. Pincus and H. A. Scheraga, J. Phys. Chem., 81, 1579
Ž1977..
the end of its radius of gyration. The correspond-
ing average density, used for r i; p and r j; p , is 4. P. R. Gerber, Biopolymers, 32, 1003 Ž1992..
5. A. Wallqvist and M. Ullner, Proteins, 18, 267 Ž1994..
calculated from eq. ŽA-12..
6. A. Rey and J. Skolnick, Proteins, 16, 8 Ž1993..
ri Ž j . 7. A. Rey and J. Skolnick, J. Chem. Phys., 100, 2267 Ž1994..
8. S. Tanaka and H. A. Scheraga, Macromolecules, 9, 945 Ž1976..
Ý nps 1 wp n i ; p j yD jr2 F rrr p F j q D jr2
p Ž gy .
s 9. S. Miyazawa and R. L. Jernigan, Macromolecules, 18, 534
Ý nps
p
1 wp n i ; p
Ž1985..
10. L. M. Gregoret and F. E. Cohen, J. Mol. Biol., 211, 959
Ž A-12. Ž1990..
11. D. G. Covell, Proteins, 14, 409 Ž1992..
where j is the distance from the center of a pro-
´
12. J. Skolnick and A. Kolinski, Science, 250, 1121 Ž1990..
tein scaled by the radius of gyration, r g y ,
´ and J. Skolnick, J. Chem. Phys., 97, 9412 Ž1992..
13. A. Kolinski
n i; p Ž rrr pg y . is the average number density of ´
14. A. Kolinski, A. Godzik, and J. Skolnick, J. Chem. Phys., 98,
residues i at rrr pg y , and n i; p is the total number of 7420 Ž1993..
residues of type i in the pth protein. We used a ´
15. A. Godzik, A. Kolinski, and J. Skolnick, J. Comput.-Aid. Mol.
step size of D j s 0.1. Des., 7, 397 Ž1993..

872 VOL. 18, NO. 7


PROTEIN-STRUCTURE SIMULATIONS. I

´
16. J. Skolnick, A. Kolinski, C. L. Brooks, III, A. Godzik, and A. 45. J. Kostrowicki and H. A. Scheraga, J. Phys. Chem., 96, 7442
Rey, Cur. Biol., 3, 414 Ž1993.. Ž1992..
´ and J. Skolnick, Proteins, 18, 338 Ž1994..
17. A. Kolinski 46. K. A. Olszewski, L. Piela, and H. A. Scheraga, J. Phys.
´ and J. Skolnick, Proteins, 18, 353 Ž1994..
18. A. Kolinski Chem. 97, 267 Ž1993..
47. A. Liwo, M. R. Pincus, R. J. Wawak, S. Rackovsky, S.
´
19. M. Vieth, A. Kolinski, C. L. Brooks, III, and J. Skolnick,
Ołdziej, and H. A. Scheraga, J. Comput. Chem. Žaccompany-
J. Mol. Biol., 237, 361 Ž1994..
ing article..
20. R. A. Goldstein, Z. A. Luthey-Schulten, and P. G. Wolynes,
48. F. A. Momany, R. F. McGuire, A. W. Burgess, and H. A.
Proc. Natl. Acad. Sci. USA, 89, 9029 Ž1992..
Scheraga, J. Phys. Chem., 79, 2361 Ž1975..
ˇ J. Theor. Biol., 77, 253 Ž1979..
21. N. S. Goel and M. Ycas,
´
49. G. Nemethy, M. S. Pottle, and H. A. Scheraga, J. Phys.
22. H. Wako and H. A. Scheraga, J. Prot. Chem., 1, 5 Ž1982.. Chem., 87, 1883 Ž1983..
23. H. Wako and H. A. Scheraga, J. Prot. Chem., 1, 85 Ž1982.. 50. I. K. Roterman, M. H. Lambert, K. D. Gibson, and H. A.
24. G. M. Crippen and V. N. Viswanadhan, Int. J. Peptide Prot. Scheraga, J. Biomol. Struct. Dyn., 7, 421 Ž1989..
Res., 24, 279 Ž1984.. 51. Cited in: H. Margenau and N. R. Kestner, Theory of Inter-
25. G. M. Crippen and V. N. Viswanadhan, Int. J. Peptide Prot. molecular Forces, Pergamon Press, Oxford, p. 107, 1st ed.
Ž1969..
Res., 25, 487 Ž1985..
52. B. J. Berne and P. Pechukas, J. Chem. Phys., 56, 4213 Ž1972..
26. G. M. Crippen and P. K. Ponnuswamy, J. Comput. Chem., 8,
972 Ž1987.. 53. J. G. Gay and B. J. Berne, J. Chem. Phys., 74, 3316 Ž1981..
27. G. M. Crippen and M. E. Snow, Biopolymers, 29, 1479 Ž1990.. 54. G. R. Luckhurst, R. A. Stephens, and R. W. Phippen, Liquid
Cryst., 8, 451 Ž1990..
28. P. Seetharamulu and G. M. Crippen, J. Math. Chem., 6, 91
55. A. P. J. Emerson, R. Hashim, and G. R. Luckhurst, Mol.
Ž1991..
Phys., 76, 241 Ž1992..
29. V. N. Maiorov and G. M. Crippen, J. Mol. Biol., 227, 876
56. Y. N. Vorobjev, Biopolymers, 29, 1503 Ž1990..
Ž1992..
57. Y. N. Vorobjev, Biopolymers, 29, 1519 Ž1990..
30. V. N. Maiorov and G. M. Crippen, Proteins, 20, 167 Ž1994..
¨ and J. D. Dunitz, Acc. Chem. Res., 16, 153 Ž1983..
58. H. B. Burgi
31. G. M. Crippen and V. N. Maiorov, In Protein Structure ´
59. A. Godzik, A. Kolinski, and J. Skolnick, Prot. Sci., 4, 2107
Distance Analysis, H. Bohr and S. Brunak, Eds., IOS Press, Ž1995..
Amsterdam, 1994, p. 158.
˘
60. J.-L. Fauchere and V. Pliska, Eur. J. Med. Chem., 18, 369
32. C. Wilson and S. Doniach, Proteins, 6, 193 Ž1989.. Ž1983..
33. K. Nishikawa and Y. Matsuo, Prot. Eng., 6, 811 Ž1993.. 61. R. J. Carroll and D. Ruppert, Transformation and Weighting in
34. M. J. Sippl, J. Mol. Biol., 213, 859 Ž1990.. Regression, Chapman and Hall, New York, 1988, p. 13.
35. G. Casari and M. J. Sippl, J. Mol. Biol., 224, 725 Ž1992.. 62. D. A. Ratkowsky, Handbook of Nonlinear Regression Models,
Marcel Dekker, New York, 1990, p. 38.
36. M. J. Sippl, J. Comput.-Aid. Mol. Des., 7, 473 Ž1993..
63. D. J. Lipman and W. R. Pearson, Science, 227, 1435 Ž1985..
37. S. Sun, Prot. Sci., 2, 762 Ž1993..
64. W. R. Pearson and D. J. Lipman, Proc. Natl. Acad. Sci. USA,
38. A. Liwo, M. R. Pincus, R. J. Wawak, S. Rackovsky, and H. 85, 2444 Ž1988..
A. Scheraga, Prot. Sci., 2, 1697 Ž1993..
¨ Cluster Analysis Algorithms, Halsted Press, New
65. H. Spath,
39. A. Liwo, M. R. Pincus, R. J. Wawak, S. Rackovsky, and H. York, 1980, p. 170.
A. Scheraga, Prot. Sci., 2, 1715 Ž1993.. 66. H. H. Gan and B. C. Eu, J. Chem. Phys., 100, 5922 Ž1994..
40. K. A. Dill, Biochemistry, 29, 7133 Ž1990.. 67. M. H. Hao, S. Rackovsky, A. Liwo, M. R. Pincus, and H. A.
41. E. I. Shakhnovich and A. M. Gutin, Proc. Natl. Acad. Sci. Scheraga, Proc. Natl. Acad. Sci. USA, 89, 6614 Ž1992..
USA, 90, 7195 Ž1993.. 68. Y. Nozaki and C. Tanford, J. Biol. Chem., 246, 2211 Ž1971..
42. M.-H. Hao and H. A. Scheraga, J. Phys. Chem., 98, 4940 69. A. Magalhaes, B. Maigret, J. Hoflack, J. N. F. Gomes, and H.
Ž1994.. A. Scheraga, J. Prot. Chem., 13, 195 Ž1994..
43. Z. Li and H. A. Scheraga, Proc. Natl. Acad. Sci. USA, 84, 70. D. W. Marquardt, J. Soc. Indust. Appl. Math., 11, 431 Ž1963..
6611 Ž1987.. 71. G. A. F. Seber and C. J. Wild, Nonlinear Regression, Wiley,
44. Z. Li and H. A. Scheraga, J. Mol. Struct. ŽTheochem., 179, New York, 1989, p. 228.
333 Ž1988.. 72. J. Edelman, Biopolymers, 32, 3 Ž1992..

JOURNAL OF COMPUTATIONAL CHEMISTRY 873

You might also like