Professional Documents
Culture Documents
Correspondence to: H. A. Scheraga; e-mail: has5@cornell.edu Contract grant sponsors: Polish State Committee for Scientific
This article includes Supplementary Material available Research, contract grant number: PB 190rT09r96r10; National
from the authors upon request or via the Internet at Institute on Aging, contract grant number: AG 00322; National
ftp.wiley.comrpublicrjournalsrjccrsuppmatr18r849 or Institute of General Medical Sciences, contract grant number:
http:rrwww.journals.wiley.comrjcc GM 14312; National Science Foundation, contract grant num-
ber: MCB95-13167; National Cancer Institute, contract grant
number: CA 42500
polypeptide chains 38, 39 and a dipole-path method In contrast to the on-lattice potentials, they are
Žbased on an optimal hydrogen-bond network. to functions of continuous variables. Therefore, the
convert the a-carbon trace to an all-atom back- off-lattice approach to protein folding enables local
bone.38 This procedure succeeded in predicting the energy minimization to be carried out for gener-
three-dimensional structure of the avian pancreatic ated structures. Thus, many powerful techniques
polypeptide. for global-search minimization, such as Monte
As mentioned previously, there are two ways to Carlo with Minimization ŽMCM.,43, 44 the diffusion
explore the conformational space of polypeptide equation method ŽDEM.,45 or the self-consistent
chains with the use of a united-residue potential: mean torsional field ŽSCMTF. 46 method can be
the on-lattice and the off-lattice approach. In the applied. This was the rationale for choosing the
first case, the polypeptide chain is superposed on a off-lattice potential in the present work.
discrete lattice, and the number of possible confor- The present work was aimed at determining the
mations is, therefore, finite. In the simplest ap- long-range potential for side-chain interactions. We
proach, the interaction potential is reduced to a set parameterized several functional forms for the in-
of residue]residue contact free energies.8 ] 10 The teraction potential that also include angular de-
rationale for such an approach was based on the pendence. This was motivated by the results of a
assumption that side-chain packing is the principal preliminary analysis of the average dimensions of
driving force in protein folding; more recent stud- the side chains as calculated from the Protein Data
ies, however, have shown that this assumption is
Bank, which show that ‘‘geometrical’’ anisotropy
probably not true.13 The recent approach devel- Žwhich can be defined as the ratio of the long axis
oped in Skolnick’s group incorporates many differ-
of a side chain to the geometric average of the two
ent interactions that can be responsible for protein
shorter axes. is pronounced in almost all cases.
folding: side-chain packing; local interactions; hy-
drogen bonding; surface energy; and cooperativity
´ et al. noticed that the pair distribu-
Also, Kolinski
tions of side chains exhibit some anisotropy, and
in side-chain packing and hydrogen bonding.12 ] 19
included this effect in their on-lattice statistical
The contact and hydrogen-bonding energies de-
pend on the distance and orientation of the inter- potential.14 The short-range part of the potential is
acting sites. The resulting force field was able to presented in the accompanying work.47
locate the near-native structures of a number of
test proteins as the lowest energy ones.14 ] 16, 18, 19
The parameters of the potentials for on-lattice sim- Methods
ulations were determined from a statistical analy-
sis of the distributions of interacting sites obtained REPRESENTATION OF POLYPEPTIDE
from the crystal data of known proteins. Because CHAINS AND INTERACTION SCHEME
the aforementioned force field expresses most of
the energy components as analytical functions of The united-residue model of polypeptide chains
geometry, it can also be used for off-lattice simula- adopted in this work is a natural extension of the
tions. model developed in our earlier studies.38, 39 The
For the sake of completeness, we mention here chain is represented by a sequence of a-carbons
simple lattice models of proteins in which contact ŽC a . linked by virtual bonds with attached united
free energies and other interaction parameters are side chains ŽSC. and united peptide groups Žp.
assigned arbitrary values Žusually three types of located in the middle between the consecutive
contacts are chosen: hydrophobic]hydrophobic; a-carbons. Only the united peptide groups and
hydrophobic]hydrophilic; and hydrophilic]hydro- united side chains serve as interaction sites, the
philic.; however, such models were used to study a-carbons assisting in the definition of the geome-
general statistical]mechanical characteristics of try ŽFig. 1.. As in our previous model,38, 39 all the
polypeptide chains and the folding process, but virtual bond lengths Ži.e., C a —C a and C a —SC.
have not yet been used for predicting the three-di- are fixed; the C a —C a distance is taken as 3.8 A,˚
mensional structures of real proteins.40 ] 42 which corresponds to trans peptide bonds. We
The united-residue potentials for off-lattice sim- allow, however, for variation of the side-chain
ulations have an even longer history than the on- positions with respect to the backbone Ž a SC and
lattice ones.1 ] 7, 24 ] 37, 39 They have also been used bSC ., and for the variation of the virtual-bond
with considerable success to predict the three-di- angles, u , which were assumed fixed in our earlier
mensional structure of known proteins.28 ] 31, 35, 37, 38 approach.38, 39
Us Ý USC SC i j
q Ý USC p
i j
q wel Ý Up i p j
i-j i/j i-jy1
q wt o r Ý Ut o r Ž g i .
i
q w l o c Ý Ub Ž u i . q Ur o t Ž a SC i , bSC i .
i
q wc o r r Uc o r r Ž1.
atom potential, for example, the local and tential, Up p , we use the energy function developed,
hydrogen-bonding interactions, and some es- and then parameterized through averaging of the
timated from protein crystal data. Such a all-atom ECEPPr2 potential,48, 49 in our earlier
division is motivated by the fact that, if di- studies.38, 39 The derivation of local-interaction
rect averaging is computationally feasible, as terms Ub Ž u . and Ur o t Ž a SC , bSC . from protein-crystal
in the case of the local and hydrogen-bond- data will be described in an accompanying article.
ing interactions, the resulting potential will In the accompanying work,47 we also describe the
always be more accurate than that calculated procedure for the determination of the relative
from experimental distribution functions, weights so as to locate the native structures of a
whose accuracy is severely limited by the set of training proteins as the lowest-energy ones.
sparse number of protein crystal data. Con- Therefore, our approach is a combination of all the
versely, obtaining the hydrophobic potential procedures to determine the aforementioned po-
by direct averaging is in most cases not feasi- tential. Use of distribution functions or averaging
ble, owing to the large number of degrees of of all-atom potentials to obtain individual energy
freedom over which averaging must be car- terms allows us to collect data from the PDB or
ried out Ži.e., the dihedral angles, x , for each from all-atom potential functions with meaningful
side chain. and possibly to the necessity of statistics. The use of flexible weights, which consti-
including explicit water molecules in the av- tute a small subset of adjustable parameters, en-
eraging. Such a combination was imple- ables us to scale the individual terms so as to
mented in our earlier work.38 The local-inter- obtain a folding potential. The procedure for
action and backbone hydrogen-bonding terms weight determination is described in the accompa-
were determined by direct averaging of the nying article.47
all-atom ECEPPr2 48, 49 potential. This was
motivated by the fact that local and hydro- MODELING SIDE-CHAIN INTERACTIONS
gen-bonding interactions are well repre-
sented in the ECEPP force field.50 The hy- The general form of the side-chain interaction
drophobic potential was assumed to have a ŽUSC SC . parameterized in this work is given by:
i j
e i j ' e Ž v iŽ1.
j , vi j , vi j . s ei j ei j ei j ei j
Ž2. Ž12. 0 Ž1. Ž2. Ž3. ¡s ij
for the BP potential
ri j
y1 r2
e iŽ1.j s 1 y x iŽ1.j x iŽ2.j v iŽ12.2
j si 0j
2 s ~r y si j q si 0j
for the GB potential
x iXŽ1. XŽ2. Ž2.2
j v i j q xi j v i j
Ž1.2
ij
Ž7.
with: si 5 2 y si H 2 sj 5 2 y sj H 2
x iŽ1.j s ; x iŽ2.j s Ž9.
si 5 2 q sj H 2 sj 5 2 q si H 2
y1 r2
x iŽ1.j v iŽ1.2
j q xi j v i j
Ž2. Ž2.2
y2 x i j x i j v i j v i j v i j
Ž1. Ž2. Ž1. Ž2. Ž12.
Ž8. Further, for the Gaussian overlap model, it fol-
si j s si 0j 1 y lows 52 that si Hs si 0 and sj Hs sj 0 . The constants
1y x iŽ1.j x iŽ2.j v iŽ12.2
j
s H and s 5 can be identified with the lengths of
the short and long axes of the ellipsoids, respec-
tively. Our variable parameters were s 0 and the
v iŽ1.j s ˆ r i j s cos u iŽ1.j
i j ?ˆ
uŽ1.
ratio Ž s 5rs H . 2 ; the first one can be considered as
a measure of the size, and the second of the
v iŽ2.j s ˆ r i j s cos u iŽ2.j
i j ?ˆ
uŽ2.
anisotropy of a side chain.
The same type of dependence can be assumed
v iŽ12.
j sˆ i j ?ˆ
uŽ1. uŽ2.
ij
for the constants x X , but then there would be too
s cos u iŽ1.j cos u iŽ2.j q sin u iŽ1.j sin u iŽ2.j cos f i j many parameters to be determined. Therefore, we
assumed that the constants x X and a depend on
single-residue types, namely:
where ˆ i j and ˆ
uŽ1. i j are unit vectors along the
uŽ2.
principal axes of the interacting sites Žin this work
x iXŽ1. X
j s xi , x iXŽ2. X
j s xj
identified with the C a ]SC axes., r i j is the vector Ž 10.
linking the centers of the sites, ˆ r i j is the corre- a Ž1. s ai , a Ž2. s aj
ij ij
sponding unit vector, ri j is the distance between
the side-chain centers ŽFig. 2., the constants x iŽ1.j
and x iŽ2.j are the anisotropies of the van der Waals It should be noted that, in the case of isotropic
radius, and the constants x iXŽ1. j and x iXŽ2.
j are the interactions, the GB and BP potentials become the
anisotropies of the van der Waals well depth. LJ potential, whereas the GBV potential becomes
The angular dependence of e iŽ1.j and x i j arises the LJK potential.
from the extension of the Gaussian overlap poten-
tial to the LJ-type function.53 Additional depen-
PARAMETERIZING SIDE-CHAIN INTERACTION
dence of the van der Waals well depth on orienta-
POTENTIALS
tion in the form of e iŽ2.j has been introduced by
GB.53 For the original BP potential, e iŽ2.j s 1, but Similar to earlier work,24 ] 26 we determine the
we keep its orientational dependence to preserve parameters of the potentials introduced in the pre-
the same form of the potential. The formulas are ceding section by fitting them to correlation func-
generalized in this work to the case of ellipsoids of tions and contact free energies calculated from
revolution with different axes Žthe BP and GB protein-crystal data. In doing this, we make the
potentials were originally derived for the interac- following two assumptions:
tion of identical ellipsoidal bodies.. Finally, the
function e iŽ3.j with the constants a Ž1. i j and a i j has
Ž2.
1. The correlation functions obtained by using a
been introduced in this work to account for the sufficiently large number of protein crystal
lower symmetry of the angular distribution func- data Žeach of which corresponds to a system
tions observed in protein crystals than that implied at a free-energy minimum. are sufficiently
by the three potentials outlined previously. Squar- good approximations to the correlation func-
ing in the expressions for e iŽ2.j and e iŽ3.j is done to tions of a hypothetical ‘‘stochastic’’ mixture
keep them non-negative. of nonconnected side chains. This approxi-
As in the case of the radial potentials, the con- mation is justified by the observation that,
stants si 0j , ri0j , x iŽ1.j , x iŽ2.j , x iXŽ1. XŽ2.
j , x i j , a i j , and a i j can
Ž1. Ž2.
although a crystal structure is at equilibrium
be assumed to be pair-dependent or constrained to as the whole structure, its individual parts
be calculated from single-residue constants. In this can be forced to assume geometries far from
work, we tried both procedures. For the constants locally equilibrated, locally lower energy
ri0j and si 0j , the formulas are given by eq. Ž5.. For conformations having, however, higher prob-
the case of different ellipsoids, the anisotropies of ability of occurrence in the whole structure.58
the van der Waals distances can be expressed by For example, the distributions of X—H bond
eq. Ž9.: lengths obtained from large data bases of
crystal structures are qualitatively similar to averaged over the angles u iŽ1.j , u iŽ2.j , and f i j x . Thus,
those calculated from potential-energy sur- the potentials of mean force and, in turn, the
faces of proton transfer.58 correlation functions depend parametrically on the
2. Interactions between the side chains can be constants appearing in eqs. Ž3. ] Ž8., which can be
described with sufficient accuracy by us- optimized so that the theoretical correlation func-
ing the potential of mean force, Wi j Ž ri j , tions given by eq. Ž11. fit best Žin the least-squares
u iŽ1.j , u iŽ2.j , f i j ., related directly to the corre- sense. to the correlation functions determined from
sponding side-chain pair correlation func- protein crystal data.
tions, g i j Ž ri j , u iŽ1.j , u iŽ2.j , f i j .: The limited number of data that we are able to
collect from protein crystals prohibits the direct
g i j Ž ri j , u iŽ1.j , u iŽ2.j , f i j . use of the full radial-and-angular correlation func-
tion g i j Ž ri j , u iŽ1.j , u iŽ2.j , f i j .. Therefore, our target
s exp yWi j Ž ri j , u iŽ1.j , u iŽ2.j , f i j . rRT Ž 11. function for parameter estimation includes correla-
tion functions that are averaged over some of the
where R and T are the gas constant and the variables, and side-chain contact free energies. The
absolute temperature, respectively. Accord- side-chain contact free energies are the logarithms
ing to point 1, the reference state in eq. Ž11. of the correlation functions averaged over the co-
corresponds to a hypothetical polypeptide ordination sphere of the interacting side chains. To
chain with noninteracting side chains Žthe determine the parameters of the potentials, we
unfolded state according to the classification minimized the weighted sum of the squares F ŽX.
of Godzik et al.59 .. of the differences between histograms of the radial
Because we want to exclude the effects of Ž H irj; k ., radial-angular Ž H irj;uk l m ., and angular
local interactions w since the local interactions Ž Hiufj; k l m correlation functions, as well as contact
.
are included in the terms Ub Ž u ., Ut o r Žg ., and free energies Ž Fi j ., calculated as functions of the
Ur o t Ž a SC , bSC . of eq. Ž1.x , we consider only parameters, and determined from protein-crystal
the side chains that are separated by at least data, respectively:
ten peptide groups. This also makes it legiti-
mate to disregard the direction of the chain 20 20 nri j
2
separating the residues; therefore, we assume
that Wi k , jl s Wi l , jk, where i k denotes a residue
F ŽX. s Ý Ý wi j
is1 js1
½ wr Ý
ks1
Hirj; k Ž X . y Hˆirj; k
pairs in protein p, respectively.; nri j is the num- Žcircumflex. over a quantity designates the value
ber of distance values considered for a pair of side determined from the crystal data. For the radial-
chains of types i and j; nu and nf are the numbers only potentials, LJ and LJK, the angular and
of values of the angles u Ž1. or u Ž2. ; and f , w uf , w r , radial-angular components F uf ŽX. and F ru ŽX. do
w ru , and w F are weights of the histograms and of not appear in the expression for F ŽX..
the free energy, respectively; X is a shorthand for The histograms of the correlation functions Hirj; k ,
the parameters of the target potential, and a ‘‘hat’’ Hi j; k l m , and Hiuf
ru
j; k l m are defined by:
g irj Ž r k .
Hirj; k s
Ý nks1
ri j
g irj Ž r k .
g iuf
j u k , u l , f m D cos u k D cos u l
Ž Ž1. Ž2. . Ž1. Ž2.
Hiuf s uf Ž Ž1.
1 Ý ls1 Ý ms1 g i j u k , u l , f m D cos u k D cos u l
j; k l m
Ý nks
u nu nf Ž2. . Ž1. Ž2.
by: ˚ and
and, those with a resolution not exceeding 2 A
having a chain length of at least 100 amino-acid
˜ r s 1rsr 2
w residues, were selected. Then, the percentage of
20 20 sequence homology was calculated for all pairs of
½
s Ž 1rSw . Ý Ý wi j Ž1rnri j .
is1 jsi
sequences using the FASTA program63, 64 available
on anonymous ftp from uvaarpa.virginia.edu.
nri j y1 Then, cluster analysis was carried out with the
2
minimal-tree algorithm,65 taking the values of
= Ý
ks1
Hirj; k Ž X . y Hˆirj; k
5 Ž100% y percentage homology. as distances be-
tween pair of structures. This grouped the selected
20 20
proteins into 154 families of homologous struc-
˜ uf s 1rsuf2
w
½ Ž 1rSwnu2 nf . Ý Ý wi j
is1 jsi tures. From each family, those structures were
taken that had the highest resolution or, if the
nu nu nf y1
2 resolution was the same, the longest chainŽs.. In
= Ý Ý Ý
ks1 ls1 ms1
Hiuf
j; k l m X
Ž . y ˆ Hiuf
j; k l m 5 several cases, however, both criteria were satisfied
by more than one structure. In such a case, we
˜ ru s 1rsru2
w took all the structures satisfying the criteria, di-
20 20
minishing their statistical weights when calculat-
ing the histograms of pair-correlation functions
½
s Ž 1rSwnu2 . Ý Ý wi j Ž1rnri j .
is1 jsi and contact free energies. The final list contained
195 structures, whose identities are summarized in
nri j nu nu y1
2 Table I of the Supplementary Material.
= Ý Ý Ý
ks1 ls1 ms1
Hirj;u k l m Ž X . y Hˆirj;u k l m
5
˜ F s 1rsF2
w Results and Discussion
y1
20 20
2 DISTRIBUTION AND CORRELATION
½
s Ž 1rSw . Ý Ý wi j
is1 jsi
Fi j Ž X . y Fˆi j
5 Ž 17.
FUNCTIONS
In all calculations, we assumed that the centers
where s 2 is a variance of the corresponding quan- of interactions are in the geometric centers of the
tity, Sw s Ý20 20
is1 Ý jsi wi j , and the other symbols side chains, calculated from the coordinates of the
are defined in eq. Ž13.. nonhydrogen atoms, including C a , as expressed
The standard deviations of the parameters were by:
estimated according to the Gauss]Markov for- NHi
mula62 : 1
Ri s Ý r ji Ž 19.
NHi js1
F Ž X*. y1
w s Ž x i .x 2 s w J T Ž X*. W Ž X*. J Ž X*.x i i Ž 18. where R i represents the coordinates of the geo-
Nyp
metric center of the ith side chain, r ji represents
where N s 210nu2 nf q Ž nu2 nf q 1. Ý20 20
is1 Ý jsi nr i j
the coordinates of the ith nonhydrogen atom of
q 210 is the total number of terms in eq. 13., p is
4 Ž the ith side chain, and NHi is the number of
the total number of parameters, Ji j s d ir x j is nonhydrogen atoms in side-chain i. The index i
the element of the first derivatives of the residuals denotes an individual side chain in the data base
d 1 , d 2 , . . . , d N that occur in the sum of the squares; and not the side-chain type. For glycine the posi-
W is the corresponding matrix of weights, and X* tion of the side-chain atom coincides with the
denotes the vector of the parameters at the mini- position of C a.
mum. When calculating the pair distributions and con-
tact free energies, we excluded disulfide-bonded
cystine pairs; however, the nonbonded cysteine
SELECTION OF PROTEIN STRUCTURES pairs were included. The weights of the structures
The protein crystal structures were taken from were calculated from the following formula:
the Brookhaven Protein Data Bank. First, the list of 1
available structures obtained from the PDB server wp s Ž 20.
pdb.pdb.bnl.gov Žas of June 25, 1994. was scanned n ch ai n n h o m res 2
where n ch ai n is the number of equivalent chains in quite well the Leu]Leu pair distribution function
a protein, n h o m is the number of homologous struc- Žcurve C of Fig. 3. for distances longer than 10 A.˚
tures, if more than one was taken from a family, The correlation function Žcurve A of Fig. 3. at
and res is the crystallographic resolution. The distances longer than 10 A ˚ is almost constant.
weights, wp , appear in equations in the Appendix. Greater deviations occur only at very long dis-
The single-body densities of the amino-acid side tances; this can be explained by the fact that the
chains were calculated from eq. ŽA-12. of the Ap- single-body density is determined with poor accu-
pendix, and were subsequently used in the evalua- racy at longer distances.
tion of the factor Ti j used for the calculation of To calculate the reference angular distribution
reference Ži.e., in the absence of any side-chain functions, we averaged the computed angular dis-
interactions. radial and radial-angular probability tribution functions over all pairs of side chains,
distributions w eqs. ŽA-7. and ŽA-9. of the Ap- using the method of Hao et al.67 Žsee eq. ŽA-13.
pendixx . A sample collection of radial pair densi- and the following text in the Appendix for details..
ties together with the reference and total pair dis- However, the angular pair correlation functions,
tribution functions is shown for the Leu]Leu pair calculated from eq. ŽA-5., still exhibited the behav-
in Figure 3. The radial distribution Žcurve C of Fig. ior of the ‘‘background’’ correlation function Žsolid
3. qualitatively resembles that calculated theoreti- curve of Fig. 4. for many pairs of residues. By
cally by Gan and Eu66 in their study of model van least-squares fitting, we found that the ‘‘back-
der Waals polymer chains. As shown, the distribu- ground’’ angular pair correlation function can be
tion calculated from single-body density and the described by e Ž3. of eq. Ž6.. Therefore, we included
Markovian factor Žcurve B of Fig. 3. approximates e Ž3. in the angular potentials.
FIGURE 3. Sample pair-distribution and pair-correlation functions for the Leu ]Leu pair averaged over consecutive
˚ shells. (A) Radial pair correlation functions girj ; (B) the reference pair distribution function n ( 2, 0 , r ) [denominator in
0.5-A
eq. A-4)]; and (C) the total pair distribution function n (2, r ) [numerator in eq. (A-4)]. All graphs were normalized to the
(
maximum values of 1.0.
FIGURE 4. The angular correlation functions g uf Ž u Ž1., u Ž 2., f . averaged over all pairs of side-chain types and over the
angle f for displaying purpose. The surfaces were normalized so that 1 is the maximum value for both. Solid surface:
the function obtained from the PDB averaged over all the pairs of side chains. Dashed surface: the function obtained in
simulation studies. The latter were carried out by generating a total of 1000 50-residue energy-minimized chains with
random sequence confined to the ellipsoid characteristic of proteins of this size, according to the method of Hao et
al.,66 with the united-residue representation of polypeptide chains and the energy function developed in our earlier
work 38, 39 ; that is, the side-chain interaction potential did not have explicit angular dependence. The simulation study
shows that the background angular pair correlation function is not constant, even for a radial side-chain interaction
potential.
Cys Met Phe Ile Leu Val Trp Tyr Ala Gly Thr Ser Gln Asn Glu Asp His Arg Lys Pro
Cys y4.54 y4.72 y4.94 y4.90 y4.72 y4.53 y4.60 y4.08 y3.93 y3.68 y3.73 y3.56 y3.31 y3.14 y2.97 y3.12 y3.87 y3.02 y2.71 y3.50
Met 198 y4.80 y5.03 y4.95 y4.91 y4.59 y4.87 y4.30 y3.93 y3.40 y3.62 y3.35 y3.24 y3.04 y2.90 y2.80 y3.80 y3.15 y2.58 y3.43
Phe 476 735 y5.22 y5.25 y5.18 y4.89 y5.07 y4.43 y4.04 y3.52 y3.70 y3.46 y3.25 y3.17 y2.89 y2.95 y3.95 y3.26 y2.70 y3.54
Ile 541 817 1920 y5.29 y5.21 y5.01 y4.98 y4.51 y4.18 y3.61 y3.87 y3.58 y3.30 y3.13 y3.11 y3.09 y3.74 y3.42 y2.95 y3.62
Leu 739 1365 3123 3816 y5.03 y4.85 y4.87 y4.32 y4.06 y3.56 y3.61 y3.47 y3.17 y3.09 y2.85 y2.84 y3.69 y3.24 y2.74 y3.55
260 216 944 1237 3213 2185 97 360 2534 1793 842 971 223 452 421 504 261 270 443 396
861
PROTEIN-STRUCTURE SIMULATIONS. I
LIWO ET AL.
the terminology of Godzik et al.59 ., rather than to a due to side-chain]side-chain and backbone]back-
completely unarranged polypeptide chain Žthe U bone interaction. To obtain the peptide-group
state 59 ., which is the reference state in the ]peptide-group interaction free energy for any
Tanaka]Scheraga,8 MJ,9 and our approach. In the amino acid, we subtract from eG l y G l y Žobtained
14
´
Kolinski]Skolnick and Gregoret]Cohen10 poten- from the PDB. the contribution of the CH 2 group
tials, the nonspecific grouping of side chains into of Gly. We take the latter as y0.528 RT lnŽ10.p C H 2
the hydrophobic core and hydrophilic exterior is from eq. Ž24., with p C H 2 s 0.41 ŽRef. 60., which can
accounted for by one-body centrosymmetric poten- be considered as an estimate of the glycine ‘‘side-
tials, whereas in our approach it is encoded in the chain]side-chain’’ contact free energy. Then, we
side-chain pair potentials. rescale our contact free energies by introduction of
According to Miyazawa and Jernigan,9 the the factor 1.60 of eq. Ž23. for nonproline residues.
quantities 0.5qi e i , where qi is the coordination Further, because Pro has no backbone NH donor
number of residue of type i and e i s Ý20 is1 group, we have to reduce the corresponding esti-
Ni j e i jrÝ20
is1 Ni j is the average contact free energy mate of the peptide-group]peptide-group interac-
of residue of type i, can be regarded as hydropho- tion free energy by a factor 39 f P r o or f P r o P r o . Thus,
bicities of the corresponding types of residues. the estimates of experimental contact free energies
Therefore, we correlated these quantities with can be expressed by eq. Ž25.:
side-chain hydrophobicities determined by
Fauchere and Pliska ˇ who measured the partition RT
Fˆi j s
coefficients of amino acids between n-octanol and 1.60
water,60 and obtained the following correlation
equation:
¡e ij y Ž eG l y G l y q 0.528 ln Ž 10 . p C H 2 . ,
if both i and j / Pro
yRT = Ž 0.5qi e i . s 1.60 Ž 0.15.
= w RT ln Ž 10. p i x y 10.50 Ž 0.23. ; =~e ij y f P r o Ž eG l y G l y q 0.528 ln Ž 10. p C H 2 . ,
if only one of i or j s Pro
R s 0.9278 Ž 23.
e i j y f P r o P r o Ž eG l y G l y q 0.528 ln Ž 10. p C H 2 . ,
where T s 298 K and p i is the contribution of the ¢ if both i and j s Pro
side chain of type i to the logarithm of the parti-
Ž 25.
tion coefficient between n-octanol and water, as
ˇ 60 The correla-
determined by Fauchere and Pliska. Finally, it should be noted that the computed
tion graph is shown in Figure 5. contact free energies are additive to a good ap-
Similar correlation also holds with the diagonal proximation, which is reflected in the following
contact free energies, e i i : correlation equation:
yRTe i i s 0.528 Ž 0.046. e i j s 1.050 Ž 0.020.Ž e i i q e j j . r2
= w RT ln Ž 10. p i x y 1.197 Ž 0.070. ;
q0.072 Ž 0.068. ; R s 0.9669 Ž 26.
R s 0.9372 Ž 24.
in which the slope and intercept do not differ
The correlation with other hydrophobicity scales significantly from 1 and 0, respectively. The quan-
derived on the basis of thermodynamic data, for tity Ž e i i q e j j .r2 is called the ideal pair-interaction
example, those of Nozaki and Tanford,68 is worse free energy, while the quantity e i j y Ž e i i q e j j .r2 is
Žwith R s 0.8019.. The correlation is also worse called the excess pair-interaction free energy.59
Ž R s 0.8518. when the contact free energies ob- Equation Ž26., together with eq. Ž24. can serve to
tained with R c s 6.5 A ˚ are used. The latter fact is estimate the interresidue contact free energies of
understandable in view of the fewer number of non-natural amino acids which do not occur in the
contacts and therefore poorer statistics, especially data base of the structures of known proteins, but
for hydrophilic pairs. for which the water]n-octanol transfer free ener-
The slopes of eqs. Ž23. and Ž24. were used to gies can be measured directly or estimated from
estimate the ‘‘true’’ free energies of contacts imple- QSAR equations. It must be noted, however, that,
mented in the sum of squares given by eq. Ž13.. As for quite a number of side-chain pairs, there are
in our earlier work,38, 39 we considered the residue several outliers that depart from eq. Ž26. by more
contact free energies to be composed of the parts than the standard deviation; this is illustrated in
value of the function in the bin. These step sizes from isotropic potentials. Both gave the same final
were chosen as a compromise between computa- results.
tional efficiency and the error caused by too coarse Based on eqs. Ž17., the final estimated ratios of
a grid in the integration. Use of a finer grid re- the weights of eq. Ž13. were w r :w uf :w ru :w F s
sulted in differences in free energies and his- 1:20:20:20 for all models.
togram values less than 1%. To increase computa- Equation Ž13. contains pair-specific and single-
tional efficiency, minimization was first carried residue-specific terms. We initially made trial runs
˚ dq s
out with a coarse grid Ži.e., d D s D r s 0.5 A, by assuming that all the parameters are pair-
d w s pr6. and then completed Žstarting from the specific; that is, we avoided the relations in eqs.
computed parameters. using a finer grid Žd D s Ž5., Ž9., and Ž10.. However, for most of the side-
˚ dq s pr24, and d w s pr6..
0.125 A, chain pairs, the results were unreasonable, with
The starting values of e 8 were the free energies the standard deviations exceeding the parameter
of contacts calculated from eq. Ž25.. The values of values. Therefore, we decided to use eqs. Ž5., Ž9.,
s 8 in the LJK, GB, and GBV potentials and the and Ž10. to express all the constants except e i j in
values of r8 in the LJ potential were initially as- terms of single-body constants.
signed half the side-chain van der Waals distances The fit of the functional forms of the potentials
calculated by Levitt.2 The initial values of rTi j in considered in this work to experimental data is
the GBV and LJK potentials were 1.3 A, ˚ this being compared in terms of the F-test 71 in Table II for
the approximate van der Waals radius of the the radial and anisotropic potentials, respectively.
‘‘outer’’ atoms of the side chains. For the In the case of the radial potentials, the LJK model
anisotropic parameters Ž s 5rs H . 2 and x X , one Žshifted Lennard]Jones. appears clearly superior
choice was based on the ratio of the long and the to the simple LJ model, the level of significance of
geometric mean of the shorter principal axes of the introducing the ‘‘shift’’ parameters r8 being close
moments of inertia of the side chains calculated by to 100%. A similar situation occurs for the
averaging their geometry, and another start was anisotropic potential where the GBV functional
TABLE II.
Comparison of Fit of Various Radial and Anisotropic Potentials to Experimental Data.
Potential Fa F uf F ru Fr F F = 1000 pb DF c Fd
N y pi F (X i ) y F (X* )
Fi =
p* y pi F ( X* )
where N is the number of terms in the expression for F; N = 3451 for the radial and 165,487 for the anisotropic potentials,
X* = ( x1U , x U2 , . . . , x Up* )T is the vector of the parameters of the best model, and X i = ( x i;1 , x i;2 , . . . , x i,p i )T is the vector of the
parameters of ith inferior model pi - p* (see ref. 70). With the large values of N taken in this study, the best-fitting potentials are
effectively different from the inferior ones at the 100% significance level. Because the model with the BP potential is not nested in
the model with the GBV potential, the F-test value is not given in this case.
e
The whole sum of squares was minimized, but with the radial LJK potential, which can be considered as the GBV potential devoid
of the angular terms.
form, which allows for free ‘‘shifting’’ of r in eq. als show an apparent trend: the negative ones
Ž7., appears superior to both the BP and GB forms occur for negative and positive ones for positive
Žhowever, BP is a model not nested in GBV and, contact free energies. This trend is partially elimi-
therefore, we did not carry out its statistical com- nated for the BP potential and remains only for
parison in terms of the F-test.. Again, the signifi- strongly hydrophilic pairs in the case of the GBV
cance level of introducing the new class of param- potential. The last observation may indicate that
eters r8, when passing from GB to GBV, is effec- none of the models is fully adequate for hy-
tively 100%. Of the two potentials with fewer drophilic pairs. On the other hand, such pairs
parameters, the BP form gives a better fit to the interact weakly and therefore this should not cause
experimental data. From Table II, it also follows great concern.
that the LJK potential gives a significantly poorer Sample theoretical and experimental histograms
fit to the data that contain both radial and angular of the radial and angular correlation functions Žthe
terms, that is, the angular terms are statistically latter being averaged for visualizing purposes over
significant. the dihedral angle f . corresponding to the
The lesser adequacy of the GB form, compared Leu]Leu pair are shown in Figure 8A and 8B.
with the BP and GBV ones, also follows from the To test how the parameters of the side-chain
plot of the residuals in the contact free energies interaction potentials depend on the choice of the
shown in Figure 7. For the GB potential, all residu- data base, we evaluated the parameters of the
FIGURE 7. Plots of weighted residuals of the contact energies corresponding to the BP (crosses), GB (squares), and
GBV (diamonds) potentials.
FIGURE 8. Sample calculated (dashed line and dashed surface) and experimental (solid line and solid surfaces)
histograms of radial (A) and angular (B) correlation function for the Leu ]Leu pair. For visualizing, the histograms of the
angular correlation functions are averaged over the dihedral angle f .
GBV potential, using the set of 42 protein struc- dard deviations. For hydrophobic residues, the
tures of Miyazawa and Jernigan9 ; the GBV poten- values of the well-depth anisotropy x X are small,
tial contains the greatest number of adjustable although the anisotropy measures of the van der
parameters and should, therefore, be the most sen- Waals radii s 5rs H are significantly different from
sitive to data-base selection. ŽIn the section ‘‘Con- 1.0. Well-depth anisotropies appear significant for
tact Free Energies,’’ we have already shown that neutral and hydrophilic residues.
the contact free energies determined from our data It is interesting to compare the computed pa-
base of 195 protein structures are in very good rameters with contact free energies and estimates
agreement with those determined by Miyazawa of the van der Waals radii and anisotropies that
and Jernigan.. For the values of e ( of eq. Ž6., can be determined from the geometrical character-
which range from y12 to q1.6, the correlation istics of the side chains. Such comparison is pre-
coefficient was 0.8569 and the mean-square differ- sented for the GBV model in Figure 9A]D.As
ence was 0.3 kcalrmol. For other parameters for shown, the contact free energies of hydrophobic
which the range is not so extensive, correlation pairs Žfor which e i j ) 0. correlate quite well with
coefficients of approximately 0.8 were obtained. In their van der Waals well depths determined by
view of the fact that the two data bases have minimizing F of eq. Ž13.. The values of r8; L, used
almost no structure in common and the MJ data as the van der Waals distances in our earlier work,
base is much smaller than ours, the parameters of correlate with the values of s T , with the definite
the GBV potential determined from the two data exception of aromatic residues and arginine ŽFig.
bases are reasonably consistent. 9B.. A similar situation occurs when the values
corresponding to the LJK potential are taken into
account. The correlation is even better Žwith the
DISCUSSION OF COMPUTED PARAMETERS
exception of Lys. when the values of s 5 calcu-
The computed values of eTi j and the single-body lated from s 8 and the ratio s 5rs H are consid-
parameters of eq. Ž6. and their standard deviations ered ŽFig. 9C.. On the other hand, there is no
for the GBV side-chain interaction potential con- correlation between the values of r8 from our
sidered in this work are given in Tables IIIa and earlier work and the constants r8 of Eq. Ž7..
IIIb. The parameters for the other four simpler There is no correlation between the parameters
potential functions are included in Tables 2a, 2b to and the ratio of the long to the short axes of the
5a, 5b of the Supplementary Material, which also side chains determined by diagonalizing the aver-
contains the parameters for all five potentials in age matrices of the moments of inertia determined
machine-readable form. from the PDB. Thus, estimating these parameters
Except for eTL y s L y s of the GBV model, and the based on the average dimensions of a side chain is
constants x X and a, the parameters are well deter- incorrect. On the other hand, it is interesting to
minable and significantly greater than their stan- note that anisotropy parameters correlate with the
Cys Met Phe Ile Leu Val Trp Tyr Ala Gly Thr Ser Gln Asn Glu Asp His Arg Lys Pro
Cys 1.05 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.03 0.02 0.02 0.02 0.02 0.02 0.02 0.02
Met 1.26 1.45 0.01 0.01 0.01 0.01 0.02 0.02 0.02 0.02 0.02 0.02 0.03 0.03 0.03 0.02 0.02 0.02 0.03 0.02
Phe 1.19 1.34 1.27 0.01 0.01 0.01 0.02 0.01 0.01 0.01 0.01 0.01 0.02 0.02 0.01 0.01 0.02 0.01 0.01 0.01
Ile 1.30 1.47 1.41 1.58 0.01 0.01 0.02 0.01 0.01 0.01 0.01 0.01 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.01
Leu 1.25 1.51 1.40 1.59 1.55 0.01 0.02 0.01 0.01 0.01 0.01 0.01 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.01
0.03 0.03 0.02 0.01 0.01 0.01 0.03 0.02 0.02 0.02 0.01 0.02 0.07 0.10 0.25 0.09 0.04 0.01 18. 0.02
867
PROTEIN-STRUCTURE SIMULATIONS. I
LIWO ET AL.
TABLE IIIb.
Calculated Values of Single-Body Parameters of the GBV Potential. (Standard Deviations in Parentheses.)
Except for s 0 and r 0 All Quantities Are Dimensionless.
Residue s T ŽA
˚. rT ŽA
˚. Ž s 5rs H . 2 xX a
Cys 2.3204 (0.0382) 5.7866 (0.2123) 2.6006 (0.2042 ) y0.0025 (0.0155) 0.0299 (0.0070 )
Met 2.4984 (0.0237) 3.5449 (0.1299) 4.4303 (0.2018 ) 0.0968 (0.0108 ) 0.0878 (0.0058 )
Phe 2.2823 (0.0245) 6.3367 (0.1459) 3.9640 (0.1815 ) 0.0491 (0.0077 ) 0.0801 (0.0039 )
Ile 2.5919 (0.0150) 4.4859 (0.0860) 3.2406 (0.0926 ) 0.0897 (0.0061 ) 0.0664 (0.0031 )
Leu 2.8905 (0.0098) 3.3121 (0.0548) 2.3636 (0.0406 ) 0.0749 (0.0047 ) 0.1108 (0.0027 )
Val 2.7251 (0.0125) 3.7770 (0.0629) 2.0347 (0.0514 ) 0.0770 (0.0062 ) 0.0679 (0.0032 )
Trp 1.6947 (0.0644) 9.2904 (0.3832) 7.5089 (1.0060 ) 0.0731 (0.0170 ) 0.0549 (0.0076 )
Tyr 2.1346 (0.0271) 4.8607 (0.1529) 5.9976 (0.3578 ) 0.1177 (0.0134 ) 0.0438 (0.0064 )
Ala 2.4366 (0.0100) 2.1574 (0.0423) 1.8090 (0.0396 ) 0.0333 (0.0077 ) 0.1052 (0.0041 )
Gly 2.3359 (0.0169) 2.5197 (0.0532) 1.0429 (0.0498 ) 0.2238 (0.0127 ) y0.1277 (0.0073)
Thr 2.6047 (0.0188) 3.0723 (0.0996) 2.2451 (0.0899 ) 0.0236 (0.0162 ) y0.0264 (0.0075)
Ser 2.4471 (0.0203) 2.2432 (0.0770) 1.6795 (0.0749 ) y0.0029 (0.0184) y0.0348 (0.0083)
Gln 2.6269 (0.0229) 1.1813 (0.1189) 2.6172 (0.1455 ) 0.2960 (0.0305 ) 0.0505 (0.0165 )
Asn 2.6954 (0.0165) 0.7634 (0.0826) 2.0433 (0.0850 ) 0.2732 (0.0286 ) y0.0064 (0.0152)
Glu 2.5933 (0.0191) 1.2819 (0.0874) 2.5707 (0.1327 ) 0.4904 (0.0332 ) y0.0266 (0.0175)
Asp 2.5098 (0.0192) 1.4061 (0.0804) 1.9262 (0.0925 ) 0.3090 (0.0299 ) 0.0250 (0.0164 )
His 2.3409 (0.0323) 3.3570 (0.1817) 3.6263 (0.2703 ) 0.1351 (0.0245 ) 0.0589 (0.0124 )
Arg 2.3694 (0.0214) 1.8119 (0.1201) 6.6061 (0.3758 ) 0.2624 (0.0270 ) 0.0062 (0.0130 )
Lys 2.7249 (0.0161) 0.2712 (0.0913) 8.0078 (0.2948 ) 0.5790 (0.0364 ) 0.0115 (0.0161 )
Pro 2.7230 (0.0228) 3.3320 (0.1059) 1.7905 (0.0759 ) y0.1105 (0.0160) y0.0190 (0.0065)
dimensions of the side chains; larger side chains where x i j is defined by eqs. Ž4. and Ž7. for the LJK
are more likely to exhibit more pronounced and GBV potentials, respectively, and hŽ y . is the
anisotropy ŽFig. 9D.. step function of y; hŽ y . s 0 for y F 0 and hŽ y . s 1
Finally, it should be noted that, in the case of otherwise.
the LJK and GBV potentials, for many of the side Thus, the modified expression includes a short-
chains, the constants, r8, exceed s 8 Žsee Table 3b range ‘‘repulsive core’’ potential which prevents
and Table 3b of the Supplementary Material.. Par- the collapse of the side chains. Introduction of this
ticularly large r8 values occur for the aromatic repulsive core does not impair the fit of the poten-
residues which exhibit broad radial distributions. tials to the experimental data.
This means that the potential will rarely tend to
infinity as side-chain separation approaches zero.
This does not seem to be the result of inadequacy
of the fitting procedure, because we included the Conclusions
regions in which the radial-correlation function is
zero. We have also carried out additional trial runs We have parameterized several functional forms
by assuming lower exponents in eq. Ž2. than 6 and for the potential of mean force of side-chain]side-
12, which results in broadening the potential wells. chain interactions that are based on reasonable
However, we still obtained r8 ) s 8 for most of the site]site interaction potentials used in molecular
side chains. Thus, to use the LJK and GBV poten- simulations. The parameters of the potentials have
tials in simulations, in these two cases we changed been determined consistently by fitting the energy
the general form of the potential given by eq. Ž2. expressions to the correlation functions and con-
to: tact free energies obtained from high-resolution
protein crystal data. Compared to related work on
Ui j s 4 < e i j < x 12 deriving the mean-field potentials from protein-
i j y ei j x i j
6
crystal data,24 ] 31, 34 ] 36 our approach has two new
12
1
siTj features: inclusion of anisotropy of the free-energy
4eTi j h Ž rTi j siTj .
2
q y
ž / ri j
Ž 27. surface, and explicit use of thermodynamic data to
rescale the free energies of contacts w eq. Ž25.x . The
FIGURE 9. (A) Correlation between the hydrophobicity-scaled contact free energies, Fi j (abscissae), and the
corresponding van der Waals well depths, eT ( )
i j , corresponding to the GBV potential ordinates . The straight line is the
least-squares line calculated for the ‘‘definitely hydrophobic’’ pairs with e i j G 0 kcal / mol; its equation is e
= y0.865(0.040) F + 0.425(0.022); R = y0.8417. (B) Correlation between the mean radii of side-chain contacts derived
by Levitt 2 (abscissae) with the computed values of s T of the GBV potential. After eliminating five apparent outliers: Phe,
Tyr, Trp, His (aromatic side chains), and Arg, the equation is s T = 0.1552(0.041) r 0;L + 1.72(0.23); R = 0.7254. (C)
Correlation between Levitt’s mean side-chain contact distances and the values of s 5 calculated from the parameters of
the GBV potential. After eliminating lysine, the equation is s 5 = 0.89(0.13) r 0;L y 0.94(0.75); R = 0.8619. (D) Correlation
between Levitt’s mean side-chain contact distances and the values of s 5rs H corresponding to the GBV potential. After
eliminating lysine, the equation is s 5rs H = 0.463(0.072) r 0;L y 0.97(0.43); R = 0.8417.
second feature enables us to consider the com- Computations were carried out with one pro-
puted free energies of side-chain interactions as cessor of the IBM-SP2 computer at the Cornell
absolute values that can be compared directly with National Supercomputer Facility, a resource of the
experimental data andror the results of calcula- Center for Theory and Simulation in Science and
tions with all-atom potentials Žincluding hydra- Engineering at Cornell University, which is funded
tion.. by the National Science Foundation, New York
The choice as to which potential to use in simu- State, the IBM Corporation, and members of its
lations should be based on the balance between Corporate Research Institute, with additional funds
the accuracy of the representation of free-energy from the National Institutes of Health.
surface and computational effort. Regarding these
two issues, the potentials can be ordered as fol-
lows: GBV, BP and GB, LJK, LJ. The GBV potential Appendix: Definition and Calculation
Žthat includes angular dependence. most accu- of Side-Chain Pair Correlation
rately represents the free-energy surface, but in- Functions from Protein-Crystal Data
volves the greatest computational effort, whereas
the LJ potential Žradial-only. is the simplest, but Assume that we have a data base of np protein
least accurate representation of the free-energy structures. Let n iŽ2. j; p r, u
Ž Ž1.
, u Ž2., f . denote the num-
surface, and should be used when the computation ber density of pairs of side chains of types i and j
time is a significant issue. at a distance r and orientation defined by the
angles u Ž1., u Ž2., and f for protein p, all assumed
to be at the same temperature Žfor brevity of nota-
Acknowledgments tion, we omit the side-chain-pair subscripts ij in
the symbols of the variables throughout the Ap-
This work was supported by Grant PB pendix.. Because we cannot determine the actual
190rT09r96r10 from the Polish State Committee density at a point from experimental data, instead
for Scientific Research ŽKBN. Žto A.L. and S.O.., by we will consider n iŽ2. j; p r, u
Ž Ž1.
, u Ž2., f ; T . defined
Grant AG 00322 from the National Institute on as the average number density in the bins
Aging Žto S.R.., by Grant GM-14312 from the Na- b k l m n s bŽ r k , u lŽ1., umŽ2., fn . s w r k y D rr2, r k q Dr2 x
tional Institute of General Medical Sciences, by =w u lŽ1. y D u r 2, u lŽ1. q D ur2x = w umŽ2. y D u r 2, umŽ2.
Grant MCB95-13167 from the National Science qD ur2x = w f n y D f r2, f n q D f r2x , k s 1,
Foundation Žto H.A.S.., and by Grant CA 42500 2, . . . , nr, l s 1, 2, . . . nu , m s 1, 2, . . . nu , n s 1,
from the National Cancer Institute Žto M.R.P... 2, . . . nf :
where r k s Ž k y1r2. D r, u lŽ1. s Ž l y 1r2. D u , umŽ2. s and reference pair number densities that can be
Ž m y 1r2. D u , fn s Ž n y 1r2. D f , D r s 0.5 A, ˚ Du evaluated from the data base of protein structures,
T T
s 30 , D f s 30 , nri j s int ri j rD r , nu s pr6,
Ž max . the pair correlation function can be calculated as
nf s 2pr6, where int is the integer part of a num- follows:
ber; the values of r m a x are defined by eq. Ž12..
The pair number density n Ž2. can be decom- Ý nps
p
1 wp n i j; p r , u
Ž2. Ž Ž1.
, u Ž2. , f .
posed into the pair correlation function for residues g i j Ž r , u Ž1. , u Ž2. , f . s
Ý nps 1 wp n i j; p r , u Ž1. , u Ž2. , f .
p Ž2 , 0. Ž
of types i and j, g i j Ž r, u Ž1., u Ž2., f . Žassumed to
depend only on the types of the side chains and Ž A-2.
not on the protein in which they reside. and the
reference-state pair number density n i2,0 j; p r, u
Ž Ž1.
, where wp is the statistical weight of the pth pro-
u , f ., corresponding to a hypothetical chain with
Ž2.
tein in the sample; the choice of weights is dis-
noninteracting side chains. Thus, given the actual cussed in the Results section w eq. Ž20.x .
rq 2p
uyŽ1. s u Ž1. y D ur2, uqŽ1. s u Ž1. q D ur2, Hr Huu Huu H0
Ž1. Ž2.
= q q
g i j Ž r ; q Ž1. , q Ž2. , w . dV
Ž1. Ž2.
y y y
uyŽ2. s u Ž2. y D ur2, uqŽ2. s u Ž2. q D ur2,
Ž2 , ru . Ž
Ý nps 1 wp n i j; p r , u Ž1. , u Ž2. .
p
fys f y D fr2, fqs f q D fr2, f Ž A-6.
Ž0 ,2, ru . Ž Ž1.
Ý nps 1n u , u Ž2. , f .
p
dV s D sin q2 Ž1.
sin q Ž2.
d D dq Ž1.
dq Ž2.
dw
where g r , g uf , and g ru denote the correlation
and: functions averaged over all angles Ž u Ž1., u Ž2. and
f ., r, and the rotation angle, f , respectively; we
DV s Ž 1r3.Ž rq
3
y ry
3 .Ž
cos uyŽ1. y cos uqŽ1. . noticed that the dependence of the distribution
function on f is the weakest and, therefore, chose
= Ž cos uyŽ2. y cos uqŽ2. . D f
to average over f to obtain a mixed radial and
s Ž 1r3. D r 3D cos u Ž1.D cos u Ž2.D f angular correlation function w eq. ŽA-6.x . Likewise,
n Ž2, r . Ž r ., n Ž2, uf . Ž u Ž1., u Ž2., f ., and n Ž2, ru . Ž r, u Ž1., u Ž2. .
The limited number of available protein struc- denote the average number densities within
tures still makes it impossible to determine the w ry, rq x , w uyŽ1., uqŽ1. x = w uyŽ2., uqŽ2. x = w fy, fq x , and
pair correlation functions with reasonable accu- w ry, rq x = w uyŽ1., uqŽ1. x = w uyŽ2., uqŽ2. x , respectively.
racy. This is easily realized because, even the choice The reference pair distribution functions must
of a coarse grid of D r s 0.5 A, ˚ D u s D f s 308 still be defined. We assume that they can be de-
with implementation of the symmetry of the hy- composed into the radial and angular part, and
persurface in f Žonly the interval w 08, 1808x needs that the radial part can be expressed as a product
to be considered. yields 16 = 6 = 6 = 6 s 3456 of the Markovian factor, Mi j; p Ž r ., arising from the
bins w according to eq. Ž12. we take a maximum 8-A ˚ fact that the side chains are on a polypeptide
coordination sphere for a residuex for which the chain,9 a factor accounting for the finite dimen-
average correlation functions would have to be sions and nonuniform residue density, Ti j; p Ž r .,72 of
determined. Within this coordination sphere, we protein molecules, and the ‘‘background’’ angu-
have at best about 5000 points per residue pair, lar distribution function, V i j Ž u Ž1., u Ž2., f ., as given
which would mean an average of about 1.4 counts by eqs. ŽA-7. ] ŽA-9..
per bin. Therefore, in the fitting procedure we use
np
the correlation functions averaged over some ra-
n Ž0, 2, r . Ž r . s Ý wp Mi j; p Ž r . Ti j; p Ž r . Ž A-7.
dial and angular variables, respectively: ps1
where Ni j Ž r F R c . s Ý nis1
p
wp n i j; p Ž r F R c . is the The angular reference function V i j Ž u Ž1., u Ž2., f .,
weighted total number of side chains of types i was calculated as the average of the angular corre-
and j in the protein-structure data base, which are lation functions, g Ž u Ž1., u Ž2., f ., averaged over
separated by at least 10 peptide groups and whose side-chain types:
distance is less than the assumed radius, R c , of the 20 20
coordination sphere of a residue Žassumed to be 8
˚ in this work..
A
V Ž u Ž1. , u Ž2. , f . s 1r210 Ý Ý g iuj f Ž u Ž1. , u Ž2. , f .
is1 jsi
We assumed the reference function to be inde- Ž A-13.
pendent of side chain and protein; that is, it in-
cludes all side chains in all the proteins used. This ‘‘background’’ angular correlation func-
The components Mi j; p Ž r . and Ti j; p Ž r . of the ra- tion, n Ž0, 2, u , f ., is qualitatively similar to the func-
dial reference functions are defined as follows9, 72 : tion that we obtained in a test Monte Carlo simula-
tion of 1000 random-sequence and random-confor-
Mi j; p Ž r . s Ý n i j; k ; p P Ž r ; k . Ž A-10. mation 50-residue polypeptide chains confined to
kG10
an average volume 67 characteristic of a 50-residue
where k is the number of peptide groups separat- protein, using the united-residue potential of our
ing side chains of types i and j, n i j; k; p is the total earlier work 38, 39 and a procedure developed by
number of pairs of side chains of types i and j Hao et al. for confined-space simulations.67 The
separated by k peptide groups in the data base of united-residue force field did not include any
protein crystal structures, and P Ž r; k . is the side-chain anisotropy.38, 39 Nevertheless, the aver-
Markovian probability density that two side chains age angular correlation function exhibits some
separated by k peptide groups are at the distance anisotropy ŽFig. 4.. This results from the fact that
r; we assumed the form given by eq. Ž25. of ref. 9 one end of each side chain is tethered to the
for P Ž r; k .. From ref. 72: backbone and therefore the ‘‘free’’ ends of the side
chains can easily approach each other, whereas the
Ti j; p Ž r . tethered ends cannot. The similarity of the ‘‘back-
ground’’ angular correlation function obtained
1
s HS HS lSŽx, r .r Ž x . r j; p Ž y . d 2 yd 3 x from protein crystal data to the function obtained
4p r 2 Vp p p
i; p
in simulations with a radial potential Žshown in
Ž A-11. Fig. 4. justifies its use as the reference angular
correlation function.
where S p and Vp denote the region of space occu-
pied by the pth protein and the volume of this
region, respectively, SŽx, r . is the sphere of radius References
r centered at the point x, and r i and r j are the
average single-body densities of residues of types 1. M. Levitt and A. Warshel, Nature, 253, 694 Ž1975..
i and j; we assume that they depend only on the 2. M. Levitt, J. Mol. Biol., 104, 59 Ž1976..
ratio of the distance from the center of a protein to 3. M. R. Pincus and H. A. Scheraga, J. Phys. Chem., 81, 1579
Ž1977..
the end of its radius of gyration. The correspond-
ing average density, used for r i; p and r j; p , is 4. P. R. Gerber, Biopolymers, 32, 1003 Ž1992..
5. A. Wallqvist and M. Ullner, Proteins, 18, 267 Ž1994..
calculated from eq. ŽA-12..
6. A. Rey and J. Skolnick, Proteins, 16, 8 Ž1993..
ri Ž j . 7. A. Rey and J. Skolnick, J. Chem. Phys., 100, 2267 Ž1994..
8. S. Tanaka and H. A. Scheraga, Macromolecules, 9, 945 Ž1976..
Ý nps 1 wp n i ; p j yD jr2 F rrr p F j q D jr2
p Ž gy .
s 9. S. Miyazawa and R. L. Jernigan, Macromolecules, 18, 534
Ý nps
p
1 wp n i ; p
Ž1985..
10. L. M. Gregoret and F. E. Cohen, J. Mol. Biol., 211, 959
Ž A-12. Ž1990..
11. D. G. Covell, Proteins, 14, 409 Ž1992..
where j is the distance from the center of a pro-
´
12. J. Skolnick and A. Kolinski, Science, 250, 1121 Ž1990..
tein scaled by the radius of gyration, r g y ,
´ and J. Skolnick, J. Chem. Phys., 97, 9412 Ž1992..
13. A. Kolinski
n i; p Ž rrr pg y . is the average number density of ´
14. A. Kolinski, A. Godzik, and J. Skolnick, J. Chem. Phys., 98,
residues i at rrr pg y , and n i; p is the total number of 7420 Ž1993..
residues of type i in the pth protein. We used a ´
15. A. Godzik, A. Kolinski, and J. Skolnick, J. Comput.-Aid. Mol.
step size of D j s 0.1. Des., 7, 397 Ž1993..
´
16. J. Skolnick, A. Kolinski, C. L. Brooks, III, A. Godzik, and A. 45. J. Kostrowicki and H. A. Scheraga, J. Phys. Chem., 96, 7442
Rey, Cur. Biol., 3, 414 Ž1993.. Ž1992..
´ and J. Skolnick, Proteins, 18, 338 Ž1994..
17. A. Kolinski 46. K. A. Olszewski, L. Piela, and H. A. Scheraga, J. Phys.
´ and J. Skolnick, Proteins, 18, 353 Ž1994..
18. A. Kolinski Chem. 97, 267 Ž1993..
47. A. Liwo, M. R. Pincus, R. J. Wawak, S. Rackovsky, S.
´
19. M. Vieth, A. Kolinski, C. L. Brooks, III, and J. Skolnick,
Ołdziej, and H. A. Scheraga, J. Comput. Chem. Žaccompany-
J. Mol. Biol., 237, 361 Ž1994..
ing article..
20. R. A. Goldstein, Z. A. Luthey-Schulten, and P. G. Wolynes,
48. F. A. Momany, R. F. McGuire, A. W. Burgess, and H. A.
Proc. Natl. Acad. Sci. USA, 89, 9029 Ž1992..
Scheraga, J. Phys. Chem., 79, 2361 Ž1975..
ˇ J. Theor. Biol., 77, 253 Ž1979..
21. N. S. Goel and M. Ycas,
´
49. G. Nemethy, M. S. Pottle, and H. A. Scheraga, J. Phys.
22. H. Wako and H. A. Scheraga, J. Prot. Chem., 1, 5 Ž1982.. Chem., 87, 1883 Ž1983..
23. H. Wako and H. A. Scheraga, J. Prot. Chem., 1, 85 Ž1982.. 50. I. K. Roterman, M. H. Lambert, K. D. Gibson, and H. A.
24. G. M. Crippen and V. N. Viswanadhan, Int. J. Peptide Prot. Scheraga, J. Biomol. Struct. Dyn., 7, 421 Ž1989..
Res., 24, 279 Ž1984.. 51. Cited in: H. Margenau and N. R. Kestner, Theory of Inter-
25. G. M. Crippen and V. N. Viswanadhan, Int. J. Peptide Prot. molecular Forces, Pergamon Press, Oxford, p. 107, 1st ed.
Ž1969..
Res., 25, 487 Ž1985..
52. B. J. Berne and P. Pechukas, J. Chem. Phys., 56, 4213 Ž1972..
26. G. M. Crippen and P. K. Ponnuswamy, J. Comput. Chem., 8,
972 Ž1987.. 53. J. G. Gay and B. J. Berne, J. Chem. Phys., 74, 3316 Ž1981..
27. G. M. Crippen and M. E. Snow, Biopolymers, 29, 1479 Ž1990.. 54. G. R. Luckhurst, R. A. Stephens, and R. W. Phippen, Liquid
Cryst., 8, 451 Ž1990..
28. P. Seetharamulu and G. M. Crippen, J. Math. Chem., 6, 91
55. A. P. J. Emerson, R. Hashim, and G. R. Luckhurst, Mol.
Ž1991..
Phys., 76, 241 Ž1992..
29. V. N. Maiorov and G. M. Crippen, J. Mol. Biol., 227, 876
56. Y. N. Vorobjev, Biopolymers, 29, 1503 Ž1990..
Ž1992..
57. Y. N. Vorobjev, Biopolymers, 29, 1519 Ž1990..
30. V. N. Maiorov and G. M. Crippen, Proteins, 20, 167 Ž1994..
¨ and J. D. Dunitz, Acc. Chem. Res., 16, 153 Ž1983..
58. H. B. Burgi
31. G. M. Crippen and V. N. Maiorov, In Protein Structure ´
59. A. Godzik, A. Kolinski, and J. Skolnick, Prot. Sci., 4, 2107
Distance Analysis, H. Bohr and S. Brunak, Eds., IOS Press, Ž1995..
Amsterdam, 1994, p. 158.
˘
60. J.-L. Fauchere and V. Pliska, Eur. J. Med. Chem., 18, 369
32. C. Wilson and S. Doniach, Proteins, 6, 193 Ž1989.. Ž1983..
33. K. Nishikawa and Y. Matsuo, Prot. Eng., 6, 811 Ž1993.. 61. R. J. Carroll and D. Ruppert, Transformation and Weighting in
34. M. J. Sippl, J. Mol. Biol., 213, 859 Ž1990.. Regression, Chapman and Hall, New York, 1988, p. 13.
35. G. Casari and M. J. Sippl, J. Mol. Biol., 224, 725 Ž1992.. 62. D. A. Ratkowsky, Handbook of Nonlinear Regression Models,
Marcel Dekker, New York, 1990, p. 38.
36. M. J. Sippl, J. Comput.-Aid. Mol. Des., 7, 473 Ž1993..
63. D. J. Lipman and W. R. Pearson, Science, 227, 1435 Ž1985..
37. S. Sun, Prot. Sci., 2, 762 Ž1993..
64. W. R. Pearson and D. J. Lipman, Proc. Natl. Acad. Sci. USA,
38. A. Liwo, M. R. Pincus, R. J. Wawak, S. Rackovsky, and H. 85, 2444 Ž1988..
A. Scheraga, Prot. Sci., 2, 1697 Ž1993..
¨ Cluster Analysis Algorithms, Halsted Press, New
65. H. Spath,
39. A. Liwo, M. R. Pincus, R. J. Wawak, S. Rackovsky, and H. York, 1980, p. 170.
A. Scheraga, Prot. Sci., 2, 1715 Ž1993.. 66. H. H. Gan and B. C. Eu, J. Chem. Phys., 100, 5922 Ž1994..
40. K. A. Dill, Biochemistry, 29, 7133 Ž1990.. 67. M. H. Hao, S. Rackovsky, A. Liwo, M. R. Pincus, and H. A.
41. E. I. Shakhnovich and A. M. Gutin, Proc. Natl. Acad. Sci. Scheraga, Proc. Natl. Acad. Sci. USA, 89, 6614 Ž1992..
USA, 90, 7195 Ž1993.. 68. Y. Nozaki and C. Tanford, J. Biol. Chem., 246, 2211 Ž1971..
42. M.-H. Hao and H. A. Scheraga, J. Phys. Chem., 98, 4940 69. A. Magalhaes, B. Maigret, J. Hoflack, J. N. F. Gomes, and H.
Ž1994.. A. Scheraga, J. Prot. Chem., 13, 195 Ž1994..
43. Z. Li and H. A. Scheraga, Proc. Natl. Acad. Sci. USA, 84, 70. D. W. Marquardt, J. Soc. Indust. Appl. Math., 11, 431 Ž1963..
6611 Ž1987.. 71. G. A. F. Seber and C. J. Wild, Nonlinear Regression, Wiley,
44. Z. Li and H. A. Scheraga, J. Mol. Struct. ŽTheochem., 179, New York, 1989, p. 228.
333 Ž1988.. 72. J. Edelman, Biopolymers, 32, 3 Ž1992..