You are on page 1of 11

The Foldon Universe: A Survey of Structural Similarity

and Self-recognition of Independently Folding Units


Anna R. Panchenko, Zaida Luthey-Schulten, Ronald Cole
and Peter G. Wolynes*
School of Chemical Sciences
University of Illinois, Urbana
IL, 61801, USA
We have identied independently folding units, so called ``foldons'',
from non-homologous proteins representing different folds. We applied
simple statistical arguments in order to estimate the size of the foldon
universe required to construct all foldable proteins. Various alignment
procedures yield about 2600 foldons in the natural protein universe but
this estimate is shown to be rather sensitive to the chosen cut-off value
for structural similarity. We showed that foldon matching-modelling can
reproduce the major part of the main chain of several proteins with a
structural similarity measure Q-score of about 0.4 and an r.m.s. error of
about 5 A

, although the accuracy of structure prediction has been limited


so far by the small size of foldon data set. The prediction score may be
increased if one uses the set of protein fragments with optimized
sequence-structure relationships, in other works, minimally frustrated
segments. To quantify the degree of frustration of the structures of fol-
dons from our database, we searched for those foldons which recognize
their own sequence and structure upon threading. As a result we found
that about half of the foldons from our data set recognize themselves as
the best choice upon threading and therefore are individually minimally
frustrated. We showed that there is a close connection between the Q-
score of self recognition and the relative foldability () of the folding
units. Foldons having high Q-score and values are expected to be
formed in the early phase of the folding process and be observed as
stable intermediates under appropriate experimental conditions.
# 1997 Academic Press Limited
Keywords: foldon; protein fragment; structure prediction; minimal
frustration; threading *Corresponding author
Introduction
The complexity of protein sequence and struc-
ture and their relationship would be much easier
to understand if proteins could be decomposed
into parts. Independently folded regions of various
sizes have been found experimentally (Wetlaufer,
1981; Murphy et al., 1992; Griko et al., 1992; Ikura
et al., 1993; Kippen et al., 1994; Jennings & Wright,
1993) as well as investigated by different compu-
tational methods (Segawa & Richards, 1988; Holm
& Sander, 1994; Hirst & Brooks, 1995). In our
recent work we used an optimized data-based
energy function and the concept of minimal frus-
tration (Bryngelson & Wolynes, 1987) to dene
independently folding units, which we call
``foldons'' (Panchenko et al., 1996). According to
our algorithm foldons may be characterized as con-
tiguous chain segments that maximize the relative
foldability measure, arising from intra-foldon in-
teractions. The relative foldability can be expressed
as E/dE

N
_
and can be estimated using the en-
ergy landscape analysis of protein folding, which
argues that folding most readily occurs when cor-
rectly folded states are highly stabilized with re-
spect to all alternatively folded states. Here, E is
the energy difference between the correct folded
state and the mean of the distribution over mis-
folded compact structures; dE is the standard devi-
ation of the distribution over misfolded states.
Several studies on the classication of structures
of proteins and domains (Orengo et al., 1994; Islam
et al., 1995; Sowdhamini et al., 1996; Murzin et al.,
1995; Fischer et al., 1995) have shown that simi-
larities in structure and topology are most evident
at the level of domains rather than entire proteins,
and that conformational motifs recur with very
high frequency. Using potentials of mean force,
Wodak and co-workers have argued that some
J. Mol. Biol. (1997) 272, 95105
00222836/97/36009511 $25.00/0/mb971205 # 1997 Academic Press Limited
short segments have a particularly high correlation
between their sequence and structure correspond
well to the sites that form early during folding
(Rooman et al., 1991, 1992). Preliminary studies of
foldon structures show that some of them are struc-
turally similar to each other and may thus have the
same evolutionary origin. Knowledge of structural
and evolutionary relationships between foldons
would be of great use in understanding the protein
folding process and the origin of protein structural
complexity. Here we attempt to determine the num-
ber of different kinds of foldons required to con-
struct all foldable proteins. To this end we rst
make structural comparisons of foldons and extract
structurally similar ones from our original data set.
It is important to note that foldon boundaries, un-
like structural domains, depend on the sequence of
the protein. Therefore, domains dened by purely
structural criteria may differ from the foldons of a
given protein in size and structure.
Another characteristic of foldons, which can be
used in classication, is their relative foldability .
Foldons with large should, in general, exhibit
some of the properties of whole proteins such as an
ability to fold independently and to recognize their
own sequence and structure in threading
algorithms. In most energy-based alignments the
sequence of an entire protein recognizes its own
structure as being the lowest energy alignment
(Hendlich et al., 1990; Bryant & Lawrence, 1993;
Jones et al., 1992; Huang et al., 1995; Koretke et al.,
1996; DeBolt & Skolnick, 1996). This indicates that
the sequence of a protein ts its own structure
rather well. The situation in which long-range
interactions do not contradict short-range inter-
actions, i.e. secondary structure propensities are
consistent with the tertiary interactions, has been
called the ``consistency principle'' (Go, 1983). This,
in turn, has motivated the more general principle of
minimal frustration (Bryngelson & Wolyes, 1987;
Bryngelson et al., 1995, and references cited therein).
Conceptually, the original version of the consist-
ency principle, would seem to require that any div-
ision of such a protein into smaller parts produce
subunits with consistent interactions. The energy
and structure of these subunits should be the same
in isolation and within the mean eld of the native
protein, provided the interaction with the solvent
does not dominate. Experimental data show that
some of the protein fragments do in fact retain their
native structure when isolated from the zest of the
protein and can fold reversibly and independently
into the native state (Wetlaufer, 1981; Griko et al.,
1992; Ikura et al., 1993; Kippen et al., 1994). On the
other hand, the structure of some protein fragments
is distorted upon isolation, and consequent clea-
vage of the polypeptide chain can produce seg-
ments whose stable conformations have low
structural similarity to their conformations in the
native state (Wetlaufer, 1981; Gay et al., 1995). This
indicates the existence of frustration of the local in
sequence interactions within these protein segments
by the interaction with the rest of the protein.
There are different ways to overcome frustration
during folding. In many models of kinetically fold-
able proteins (Bryngelson & Wolynes, 1987, 1989;
Onuchic et al., 1995; Leopold et al., 1992) the driv-
ing force for folding largely resides in tertiary
interactions alone, which start to dominate over
local interactions as folding proceeds. In view of
this, the consistency principle required generaliz-
ation and more quantitative formulation which is
provided by the energy landscape theory. In this
theory interactions that stablize native congur-
ations over others lead to a folding funnel, result-
ing in fast folding either by a downhill mechanism
or by a mechanism involving activation to a tran-
sition state with a relatively low barrier. The com-
petition between self-organization of folding in the
main funnel and the kinetic trapping in subsidiary
funnels, occurring obligatively at a local glass tem-
perature, largely determines the overall rate of the
folding process. Folding can be very fast if the
folding temperature T
f
is much greater than glass
temperature T
g
. Under these conditions we have
``minimal frustration''. Minimal frustration quan-
ties the extent to which all interactions work
together to produce a folded, rather than a mis-
folded structure and implies smoothness of the en-
ergy landscape and relatively high rates of protein
folding. Uncertainty about the congurational en-
tropy of molten globule states makes it difcult to
give a precise value for T
f
/T
g
, but if entropy is ex-
tensive in chain length, T
f
/T
g
, is a monotonically
increasing function of . Thus, the principle of
minimal frustration allows us to estimate quantitat-
ively how fast a protein segment with partially
consistent internal interactions can fold into its
native state. If several segments have large va-
lues, this implies that these non-contiguous seg-
ments may appear as stable intermediates in the
protein folding.
The overall smoothness of the energy-landscape
as well as the stability of proteins are probably the
result of a long evolutionary process. If these fac-
tors represented the main selection pressure of
evolution and were strong enough, the protein
molecules could have become very stable, quickly
foldable, perfectly unfrustrated systems. However,
there are other selection pressures, namely that
foldable proteins should carry out specic func-
tions and should interact properly with other com-
ponents of the cell.
In the rst section of this paper we determine
the number of different kinds of foldons required
to construct all foldable proteins. To this end we
rst make structural comparisons of many foldons
and extract structurally similar ones from our orig-
inal data set. In the second section we check
whether our foldon data set is large enough to
rebuild the backbones of proteins and evaluate the
accuracy of the prediction using foldon matching
modelling. In the last part of this paper we study
how frustrated the structures of foldons from our
representative data set really are. In order to do
this we search for foldons which recognize their
96 The Foldon Universe
own sequence and structure upon threading. This
gives us an opportunity to understand what extent
the ``consistency principle'' of local versus non-local
interactions is the dominant mode of achieving
minimal frustration used by evolution.
Results
Criterion for foldon structural similarity
It can be seen from the distribution of Q-scores
(Figure. 1) that distributions for both sequence-
sequence as well as sequence-structure alignments
overlap each other in the neighborhood of Q - 0.2.
Alignment of randomly permuted sequences of tar-
get foldons with template structures yields a mean
Q-score value 0.15(0.09). As shown on Figure 1
starting from the value Q > 0.3 there is an excess of
numbers of matches for sequence-structure align-
ment compared to sequence sequence alignment.
This indicates that for Q > 0.3 sequence-structure
alignment is able to discriminate structurally simi-
lar foldons with low sequence identity from the
random matches of non-similar foldons with low
sequence identity. A precise cut-off value for the
Q-score is difcult to determine. Visual examin-
ation of the structures of foldon pairs with Q - 0.3
shows that although many structures are very
similar, others are not. Thus, Q = 0.3 can be used
as a cut-off value only tfully.
In order to nd a more appropriate cut-off value
we compared structures of 25 homologous proteins
from the globin family. The maximum of the Q-
score distribution for this data set is positioned at
Q = 0.42, and we used this value as the cut-off for
most of our studies. This cut-off corresponds to
r.m.s. deviations less than 5 A

and is in agreement
with the r.m.s. threshold - 5.25 A

obtained for 40
residues long a-turn-a motifs (Wintjens et al., 1996).
Using this cut-off value most of the homologous
proteins are classied as identical. According to our
similarity criterion two foldons A and B are
assumed to be structurally identical if the Q-score is
greater than 0.42 for both alignments: alignment of
sequence A with the structure B and alignment of
sequence B with the structure of foldon A. Using
this criterion we are able to eliminate the depen-
dence of the Q-score on the number of residues of
the target sequence and avoid matches comprising
small number of residues. Our algorithm considers
only alignments of segments of similar length as a
good match, and foldons with signicantly different
length are assumed to be different. In most cases
the lowest energy alignment corresponds to the
alignment of the entire target foldon and only in a
few cases the best alignment may leave up to three
residues uncovered at the end of the target foldon.
Estimated size of the foldon universe
We rst address the question of how many
structurally different kinds of foldons are present
in our data set. In order to nd structurally similar
foldons we made 35,910 (190 189) pair-wise
foldon comparisons based on sequence-sequence
and sequence-structure alignments with no gaps
allowed (see Table 1). That is to say, the sequence
of each foldon was aligned to both the sequence
and the structure of all other foldons from our data
set. As a result of foldon alignments, seven foldon
pairs and no triplets were found at the Q = 0.42
level using either the original energy functions or
self-consistently optimized energy functions. For
comparison, at the level of Q = 0.3 we obtain 35
foldon dublets and 28 triplets. Two examples of
structurally similar foldon pairs with different
Q-scores are presented in Figure 2(a) and (b).
In order to estimate the number of different
kinds of foldons in the underlying set of proteins,
we assume that the underlying set of foldons of
size N is randomly mixed and each foldon has the
same probability of being found in any protein.
Then, the probability to observe any foldon duplet
in the sub-set would be 1/N. If we select from the
underlying set of foldons a sub-set of size n, then
the number of possible foldon pairs occurring
upon foldon pair-wise comparison would be
n(n 1)/2. The number of matches representing
pairs of the same kind of foldon can be estimated
as n(n 1)/2N which yields about 2600 foldons in
the universe. The very generous cut-off Q = 0.3
would yield a smaller size N - 200 to 500 foldons.
The approximation we used regarding uniform
distribution of protein-folds is rather rough. A
more rened argument should take into account
that different families of protein-folds are not
equally populated (Orengo et al., 1994). It is worth
Figure 1. Distribution of Q-scores for two types of align-
ment: genetic alignment based on the Smith-Waterman
alignment algorithm with Dayhoff scoring matrix
(shaded boxes), sequence-structure alignment with the
original energy functions (opened boxes). Similar results
were obtained using the self-consistently optimized
energy function. Inset: right tail of the same distri-
butions in enlarged scale. The sequence of each foldon
was aligned to the sequence-structure of all foldons
from the data set except the target foldon.
The Foldon Universe 97
noting that the size of the foldon universe obtained
in the present work is somewhat smaller but of the
same order as the size of exon universe (1000 to
7000). This argues in favor of the idea that exons
and foldons were once related, although in some
cases the same foldon may be encoded by different
exons as a result of a convergent evolution process.
Is the current foldon data set complete enough
for structure prediction?
Several attempts have been made to use infor-
mation about the structure of protein fragments in
order to guide the conformational search pro-
cedures for complete native structures (Bowie &
Eisenberg, 1994; Srinivasan & Rose, 1995). Predic-
tion by recognition of known structures of protein
fragments based on sequence-structure alignment
algorithms (Bowie et al., 1991; Jones et al., 1992)
looks for short fragments compatible with the seg-
ment of sequence to be folded. A set of fragments
with optimized sequence-structure relationships
would therefore be of great use in solving this pro-
blem.
In order to check whether our foldon data set
can be used for protein structure prediction, we
threaded the sequence of several proteins through
the structures of foldons from our data set. We
then chose those foldons which have the highest
Q-score with the native structure of the target pro-
tein. As a result about two-thirds of the sequence
of most of the target proteins were covered with
overlapping foldons having Q > 0.42, whereas
another part of the sequence remained structurally
uncovered. The nal model structure of the target
sequence can be obtained by using the modelling
technique implemented in QUANTA as described
by Sali & Blundell (1993). According to this algor-
ithm, the comparative modelling occurs by satis-
faction of spatial restraints on the unknown
sequence. The spatial restraints are obtained by
multiple alignment of the target sequence with
overlapping foldons with Q > 0.42 and are dened
in the terms of a probability density function.
Optimization of the probability density function is
implemented in the program MODELLER. We ap-
plied MODELLER to rebuild the structure of
phage 434 Cro protein. We chose this protein since
it has very low sequence identity even with its clo-
sest homologs. The model backbone structure for
the major part of the sequence of 434 Cro protein
(residues 27 to 65) obtained by alignment of its
sequence with the structures of seven foldons from
our data set is presented in Figure 3. The model
structure obtained has an r.m.s. error of 4.8 A

with
respect to the native structure, and the Q-score is
equal to 0.37. The model structure is not as accu-
rate as an X-ray structure (resolution for 434 Cro
protein is 2.35 A

) but the r.m.s. is within the range


of r.m.s. errors for structural alignments between
homologous proteins. The energy of the model
structure calculated using the original Hamiltonian
is comparable to the energy of the region 27 to 65
of the native structure and is higher by 15%. The
largest r.m.s. value is observed in the region of
residues 42 to 44 which corresponds to the bend /
b turn and is difcult to rebuild.
Local-global consistency as a route to minimal
frustration and the applicability of principle of
minimal frustration to individual foldons
If the proteins evolved in the direction of maxi-
mizing the relative stability of the native confor-
mations and fast foldability of the protein
sequences, then threading the different sequences
from the sequence space through the target struc-
ture would nd the best matching sequence for the
given structure. To be consistent, the threading
should be based on the energy functions optimized
to maximize the rate of folding. In this sense the
threading procedure allows us to check how fru-
strated proteins and protein fragments are.
Figure 2. Examples of foldons with different Q-values
(a) Foldons of g-II crystalline with Q > 0.42 and r.m.s.
2.82 A

on C
a
atoms. These foldons as well as the corre-
sponding exons seem to be a result of the duplication in
course of the evolution. (b). Foldons from 434 Cro pro-
tein and myoglobin with averaged Q = 0.31 and r.m.s.
3.6 A

.
98 The Foldon Universe
Sequences of most proteins do in fact represent the
best matching sequences for the native confor-
mations, since these sequences recognize their
native structure as the lowest energy conformation
among the large number of alternative alignments
(Bryant & Lawrence, 1993; Goldstein et al., 1992b;
Koretke et al., 1996). In the case of foldons or pro-
tein fragments it is not clear how many of them
would nd their native structure and sequence as
a lowest energy alignment in the threading pro-
cedure.
To nd foldons that recognize their own
sequence and structure, the sequence of each fol-
don was translated along the scaffolds from 100
different protein structures. The sequences of these
proteins were then threaded through the structure
of the target foldon. Altogether, about 5000 alterna-
tive structures were generated in both cases. This
procedure was repeated for each of the 190 foldons
from the data set. Q-scores between target and
template structures were estimated according to
equation (3) and the lowest energy alignment was
taken to be the best one.
A protein molecule can be found in the different
conformations depending on the functional state,
temperature and pH conditions. We showed that
X-ray structures of the sperm whale myoglobin ob-
tained under different states of the heme iron have
a native energy difference not exceeding 10%. If
the energy of the native conformation of a target
foldon is lower or within 10% of the energy of the
best alignment of the target foldon sequence with
the template structures, this indicates that the fol-
don would recognize its own structure as the best
choice. In this case the corresponding Q-score
between the native structure of the foldon and the
structure of the best alignment is equal to 1.0. On
the other hand, if the best alignment of sequences
of template foldons with the structure of the target
foldon has a larger energy than the native state of
target foldon, we believe the target foldon recog-
nizes its own sequence. A foldon recognizing the
sequence or structure of a structurally similar fol-
don would have a Q-score very close but not equal
to 1.0. We use Q = 0.9 as a upper bound for self-
recognition.
As a result of the current sequence-structure
threading procedure, 41% of all foldons recognize
their own structure while 11% of foldons recognize
their own sequence. About 50% of all foldons from
the data set pick up neither their sequence nor
their structure as a lowest energy alignment. Thus,
if we limit ourselves by considering only individu-
ally non-frustrated foldons then the size of the
existing foldon database would be reduced by half.
Figure 3. The model structure of region 27 to 65 of 434
Cro protein as a result of threading the sequence of pro-
tein through the structures of seven overlapping foldons
from the six proteins with the following pdb codes:
2TRX, 1MBA, 3ADK, 3PTN, 5ABP and 5TLN. These fol-
dons have the Q-score 50.42 with the native regions of
434 Cro protein. The nal model structure of the target
sequence was obtained using MODELLER modelling
technique implemented in QUANTA. The overall Q-
score between the model structure and the same region
of native protein is equal to 0.37 and the r.m.s. value is
4.8 A

Figure 4. Distributions of the relative folability () for


the foldons with different Q values. The distribution of
of foldons from the rst group with the average
Q > 0.9 is shown with the shaded area. The distributions
corresponding to the Q < 0.5 is indicated by the empty
box. The maxima of distributions are considerably
shifted with respect to each other and the -axis. Distri-
bution of for the foldons from the second group with
0.5 4Q 40.9 is centered between the above mentioned
distributions and is not shown for visual clarity. The
average Q-value is calculated as the average between
two Q-scores. The rst Q-score is estimated between the
native structure of the target foldon and the structure of
the best alignment of the sequence of the target foldon
with alternative structures. The second value of the Q-
score is obtained as a result of alignment of the different
unique sub-sequences with the structure of the target
foldon.
The Foldon Universe 99
T
a
b
l
e
1
.
L
i
s
t
o
f
t
h
e
p
r
o
t
e
i
n
s
u
s
e
d
a
l
o
n
g
w
i
t
h
t
h
e
i
r
f
o
l
d
o
n
j
u
n
c
t
i
o
n
s
P
d
b

l
e
P
r
o
t
e
i
n
n
a
m
e
L
e
n
g
t
h
F
o
l
d
o
n
j
u
n
c
t
i
o
n
s
2
o
v
o
O
v
o
m
u
c
o
i
d
t
h
i
r
d
d
o
m
a
i
n
5
6
5
6
c
4
p
t
i
z
T
r
y
p
s
i
n
i
n
h
i
b
i
t
o
r
5
8
5
8
b
2
c
r
o
a
4
3
4
C
r
o
p
r
o
t
e
i
n
6
5
2
5
c
6
5
c
2
c
i
2
a
C
h
y
m
o
t
r
y
p
s
i
n
i
n
h
i
b
i
t
o
r
2
6
5
6
5
c
1
u
b
q
U
b
i
q
u
i
t
i
n
7
6
3
1
c
7
6
c
3
f
x
c
F
e
r
r
e
d
o
x
i
n
9
8
3
7
c
9
8
c
5
p
c
y
P
l
a
s
t
o
c
y
a
n
i
n
9
9
3
4
c
9
9
b
1
w
r
p
a
T
r
p
r
e
p
r
e
s
s
o
r
1
0
2
1
0
2
b
1
c
y
c
F
e
r
r
o
c
y
t
o
c
h
r
o
m
e
(
C
)
1
0
3
3
4
5
5
7
6
1
0
3
2
5
6
b
C
y
t
o
c
h
r
o
m
e
B
5
6
2
,
c
h
a
i
n
A
1
0
6
3
4
c
7
0
1
0
6
c
2
s
s
i
a
S
u
b
t
i
l
i
s
i
n
i
n
h
i
b
i
t
o
r
1
0
7
2
8
4
6
1
0
7
c
1
r
e
i
B
e
n
c
e
-
J
o
n
e
s
p
r
o
t
e
i
n
,
c
h
a
i
n
A
1
0
7
2
2
4
6
c
7
3
1
0
7
2
t
r
x
T
h
i
o
r
e
d
o
x
i
n
,
c
h
a
i
n
A
1
0
8
2
5
c
7
9
c
1
0
8
5
c
p
v
P
a
r
v
a
l
b
u
m
i
n
B
1
0
8
3
5
c
5
9
c
1
0
8
1
h
m
q
M
e
t
h
e
m
e
r
y
t
h
r
i
n
,
c
h
a
i
n
A
1
1
3
2
8
c
5
5
1
1
3
c
1
b
p
2
P
h
o
s
p
h
o
l
i
p
a
s
e
A
2
1
2
3
2
5
4
6
1
2
3
1
r
b
b
R
i
b
o
n
u
c
l
e
a
s
e
B
,
c
h
a
i
n
A
1
2
4
3
2
5
4
7
3
1
0
0
1
2
4
2
l
y
z
L
y
s
o
z
y
m
e
1
2
9
2
8
c
5
5
c
1
0
3
b
1
2
9
2
a
z
a
A
z
u
r
i
n
,
c
h
a
i
n
A
1
2
9
3
1
c
1
1
2
c
1
2
9
1
s
n
c
a
S
t
a
p
h
y
l
o
c
o
c
c
a
l
n
u
c
l
e
a
s
e
1
3
5
2
2
5
2
8
8
1
3
5
3
f
x
n
F
l
a
v
o
d
o
x
i
n
1
3
8
9
7
b
1
3
8
b
1
m
b
a
M
y
o
g
l
o
b
i
n
1
4
6
2
8
6
4
c
9
1
1
4
6
2
s
o
d
S
u
p
e
r
o
x
i
d
e
d
i
s
m
u
t
a
s
e
,
c
h
a
i
n
B
1
5
1
2
8
c
8
2
c
1
5
1
c
2
i
1
b
I
n
t
e
r
l
e
u
k
i
n
1
5
3
4
3
1
0
0
c
1
2
1
c
1
5
3
4
t
n
c
a
T
r
o
p
o
n
i
n
C
1
6
0
2
2
c
5
8
c
1
6
0
b
4
g
c
r
g
I
I
-
C
r
y
s
t
a
l
l
i
n
1
7
4
3
7
b
8
2
b
1
2
4
c
1
4
8
c
1
7
4
2
l
t
n
L
e
c
t
i
n
,
c
h
a
i
n
A
1
8
1
4
6
8
5
1
0
9
1
8
1
c
8
d
f
r
D
i
h
y
d
r
o
f
o
l
a
t
e
r
e
d
u
c
t
a
s
e
1
8
6
2
8
1
2
1
b
1
6
3
1
8
6
3
a
d
k
A
d
e
n
y
l
a
t
e
k
i
n
a
s
e
1
9
4
3
7
b
7
6
c
1
1
5
c
1
3
6
1
7
2
c
1
9
4
c
2
a
c
t
a
A
c
t
i
n
i
d
i
n
2
1
8
1
2
7
b
1
9
6
b
2
1
8
3
p
t
n
T
r
y
p
s
i
n
2
2
3
1
0
6
b
1
4
2
1
6
3
c
1
7
8
2
2
3
3
c
n
a
C
o
n
c
a
n
a
v
a
l
i
n
A
2
3
7
8
5
c
1
4
2
1
7
5
1
9
3
2
3
7
1
t
i
m
T
r
i
o
s
e
-
p
h
o
s
p
h
a
t
e
i
s
o
m
e
r
a
s
e
,
c
h
a
i
n
A
2
4
7
2
2
6
1
8
8
1
2
4
1
6
3
c
2
0
2
2
4
7
c
3
h
l
a
H
u
m
a
n
c
l
a
s
s
I
h
i
s
t
o
c
o
m
p
a
t
i
b
i
l
i
t
y
a
n
t
i
g
e
n
,
c
h
a
i
n
A
2
7
0
3
7
6
1
8
2
1
8
7
b
2
1
7
2
4
7
2
7
0
2
p
r
k
P
r
o
t
e
i
n
a
s
e
K
2
7
9
3
1
1
0
0
c
1
3
3
1
6
3
1
9
9
2
2
6
2
5
9
c
2
7
9
5
a
b
p
L
-
A
r
a
b
i
n
o
s
e
b
i
n
d
i
n
g
p
r
o
t
e
i
n
3
0
6
5
2
c
1
0
6
c
1
3
0
c
1
7
2
c
1
9
9
2
2
9
c
3
0
6
c
5
c
p
a
C
a
r
b
o
x
y
p
e
p
t
i
d
a
s
e
A
3
0
7
2
2
c
5
5
c
1
1
2
c
1
6
6
c
1
9
3
2
2
6
2
7
1
3
0
7
8
a
t
c
A
s
p
a
r
t
a
t
e
c
a
r
b
a
m
o
y
l
t
r
a
n
s
f
e
r
a
s
e
,
c
h
a
i
n
A
3
1
0
7
0
1
1
5
1
5
7
1
8
4
2
1
1
2
5
0
2
7
7
3
1
0
3
t
l
n
T
h
e
r
m
o
l
y
s
i
n
3
1
6
5
5
8
8
1
2
7
2
3
5
2
6
5
2
9
2
3
1
6
1
p
f
k
P
h
o
s
p
h
o
f
r
u
c
t
o
k
i
n
a
s
e
,
c
h
a
i
n
A
3
2
0
2
2
5
2
9
1
1
2
1
c
1
3
9
1
6
6
c
2
0
2
2
3
5
2
5
6
3
1
9
1
c
m
s
C
h
y
m
o
s
i
n
B
3
2
3
4
0
c
8
8
c
1
6
9
b
2
1
4
2
4
7
c
2
6
8
3
0
7
3
2
3
2
l
i
v
L
e
u
/
I
l
e
/
V
a
l
b
i
n
d
i
n
g
p
r
o
t
e
i
n
3
4
4
4
9
c
8
2
c
1
1
5
1
6
9
c
1
9
9
2
5
9
c
3
4
4
c
8
a
d
h
A
p
o
-
l
i
v
e
n
a
l
c
o
h
o
l
d
e
h
y
d
r
o
g
e
n
a
s
e
3
7
4
2
8
8
2
1
0
6
1
4
2
1
6
9
1
9
3
2
1
4
2
4
1
c
2
6
5
3
5
2
3
7
4
c
R
i
g
h
t
t
e
r
m
i
n
u
s
b
o
u
n
d
a
r
y
o
f
e
a
c
h
f
o
l
d
o
n
i
s
i
n
d
i
c
a
t
e
d
a
s
a
n
a
m
i
n
o
a
c
i
d
n
u
m
b
e
r
o
f
t
h
e
C
t
e
r
m
i
n
u
s
.
S
u
b
s
e
q
u
e
n
t
f
o
l
d
o
n
s
t
a
r
t
s
a
t
t
h
e
n
e
x
t
a
m
i
n
o
a
c
i
d
a
n
d
p
r
o
c
e
e
d
s
t
i
l
l
t
h
e
t
e
r
m
i
n
u
s
.
a
P
r
o
t
e
i
n
s
w
h
o
s
e
X
-
r
a
y
s
t
r
u
c
t
u
r
e
s
l
a
c
k
t
h
e
r
e
s
i
d
u
e
s
a
t
t
h
e
b
e
g
i
n
n
i
n
g
o
r
a
t
t
h
e
e
n
d
o
f
t
h
e
p
o
l
y
p
e
p
t
i
d
e
c
h
a
i
n
.
E
f
f
e
c
t
i
v
e
l
e
n
g
t
h
o
f
t
h
e
p
a
r
t
o
f
t
h
e
p
r
o
t
e
i
n
w
i
t
h
t
h
e
d
e
t
e
r
m
i
n
e
d
s
t
r
u
c
t
u
r
e
i
s
g
i
v
e
n
.
F
o
l
d
o
n
j
u
n
c
t
i
o
n
s
w
e
r
e
c
a
l
c
u
l
a
t
e
d
w
i
t
h
r
e
s
p
e
c
t
t
o
t
h
e
b
e
g
i
n
n
i
n
g
a
n
d
e
n
d
o
f
X
-
r
a
y
d
e
t
e
r
m
i
n
e
d
p
r
o
t
e
i
n
s
t
r
u
c
t
u
r
e
s
.
b
B
o
u
n
d
a
r
i
e
s
o
f
f
o
l
d
o
n
s
f
r
o
m
t
h
e

r
s
t
g
r
o
u
p
w
h
i
c
h
r
e
c
o
g
n
i
z
e
t
h
e
i
r
o
w
n
s
e
q
u
e
n
c
e
a
n
d
s
t
r
u
c
t
u
r
e
.
c
B
o
u
n
d
a
r
i
e
s
o
f
f
o
l
d
o
n
s
w
h
i
c
h
r
e
c
o
g
n
i
z
e
o
n
l
y
t
h
e
i
r
s
t
r
u
c
t
u
r
e
.
We found that all foldons can be clustered into
three groups with respect to average Q-scores and
foldability (Figure 4). The rst group contains
foldons exhibiting some characteristics of the
whole protein such as ability to fold rapidly and to
recognize its own sequence and structure. These
least frustrated foldons contain largely long-range
interactions and describe turns and super-second-
ary structures. This group is characterized by the
high values and by Q-scores being larger than
0.9. Boundaries of foldons from the rst group are
indicated by an asterisk in Table 1. As can be seen
from Figure 5, the energy gap between the native
and rst excited state for most foldons from this
group is rather large, indicating their low sequence
and structure degeneracy. Figure 5 shows the
energy distribution of the alternative structures for
the third lysozyme foldon. The sequence of this fol-
don ts its own structure rather well since there is
a big gap between the native energy and the rst
excited state. However, we found that the sequence
of the fragment 52 to 99 from the a-lactalbumin ts
the structure of this foldon even better than the
structure of lysozyme. These two proteins are very
similar in structure and the corresponding Q-score
is very close to 1. Foldons from the second group
recognize either their own sequence or structure
and some foldons from this group pick up the
sequence/structure of rather similar foldons as a
best choice. A third group comprises highly fru-
strated foldons with low values, which recog-
nize neither their sequence nor their structure
(Figure 4). It is interesting to note that the Q-score
can serve as a measure of the delity by which
foldons recognize themselves and is correlated to
their relative foldability . In other words sequences
which can recognize themselves in a threading al-
gorithm have rather a smooth energy landscape
with a deep funnel leading to the ground state.
Discussion
The key notion examined here is that proteins
can be decomposed into autonomously folding
units which can evolve independently and serve as
building blocks in construction of protein struc-
tures. In order to estimate the number of structu-
rally different foldons in the protein data base, we
performed sequence-sequence and sequence-struc-
ture alignments between foldons, and obtained
about 2600 foldons in the representative data set.
The size of the representative data set is rather sen-
sitive to the cut-off value for structural similarity.
Values of the Q-score needed for assigning struc-
tural identify cannot be precisely determined, and
Q = 0.42 may be considered as a lower limit for
the cut-off value, indicating that the size of foldon
universe would probably exceed 2600 foldons. On
the other hand, according to Chothia (1994), only a
small fraction of existing proteins is represented in
the Protein Data Bank. Assuming there is no strong
bias in the data banks and the number of proteins
with the duplicated foldons is not very large, one
can obtain approximately 1000 different protein
families. Multiplying this number by the average
number of different foldons per protein (4.4 2.5)
we obtain 4400 different foldon classes. These esti-
mates suggest that our present foldon data set,
comprising 183 types of structurally different fol-
dons, represents only a small fraction of the under-
lying foldon universe. In this sense many families
of foldons are not yet represented in our data set.
Classications of a-turn-a motifs also show that
even for fragments more than 25 residues long,
analysis yields very weakly populated families
(Wintjens et al., 1996).
We have shown that the backbones of the major
parts of several proteins can be reproduced with
an r.m.s. error of about 5 A

using foldon matching


modelling. The remaining parts of the backbone
can be rebuilt using a conventional homology
modelling technique. It was demonstrated earlier
that using a set of short fragments (3 to 15 resi-
dues) one can rebuild the backbone of some pro-
teins with a comparable accuracy to the foldon
matching modelling (Rooman et al., 1991; Simon
et al., 1991; Levitt, 1992; Unger & Sussman, 1993;
Srinivasan & Rose, 1995). However, short segments
are stablizied mostly by local interactions and the
recognition signals are statistically insufcient for
the prediction of protein structures from the amino
acid sequences. Long-range interactions in the
native structure can be largely de-stabilizing and
distort the structures determined solely by the
Figure 5. Energy distributions of alternative structures
of the lysozyme foldon (56 to 103) from group I.
Alternative structures were generated by two different
ways. In the rst case (continuous line), energies were
calculated by threading the sequence of the target fol-
don through the structures of 100 proteins from the
different classes. In the second case (broken line)
sequences of different proteins were assigned to the
structure of the foldon. All together 5200 alternative
models were generated. The energy of the native struc-
ture of the lysozyme foldon (56 to 103) is indicated by
the arrow. The large energy gap observed here is typical
for the foldons from the rst group.
The Foldon Universe 101
local propensities. Therefore, for the purpose of ter-
tiary structure prediction one should use a set of
the fragments containing both local and long-range
interactions. Accuracy of protein structure predic-
tion by foldon matching has been limited so far by
the small size of foldon data sets and would be
considerably improved if the procedure of the fol-
don identication was more effective. The method
of prediction by foldon recognition would inevita-
bly face the problem of modelling the interface
between foldons comprising mostly long-range
interactions. Our preliminary results indicate that
interactions between some of the foldons are still
rather strong and several foldons should often be
considered as a unit. For example, the energy of
interactions between third and fourth foldons of
sea hare myoglobon which comprise most of the
helices E, F, G and H, are large compared to pair-
wise interactions between other foldons of myoglo-
bin and these two foldons can serve as a single
folding unit in the folding process.
We must also point out that predicting the kin-
etic order of association of foldons may well be
complicated on the basis of the present energetic
considerations alone. The present algorithm takes
into account directly the discrimination problem be-
tween relatively compact structures, but it does not
include semi-compact or extended structures. The
proximity to the collapse transition certainly plays a
role in the fast kinetics (Abkevich et al., 1994; Socci
& Onuchic, 1994; Klimov & Thirumalai, 1996).
Analysis of thermodynamic and kinetic proper-
ties of the foldons demonstrates that division of
proteins into segments, even in an optimal way,
does not always produce stable non-frustrated
units. We found that only for avodoxin, charac-
terized by a high packing density, both foldons
mostly contain consistent interactions and are by
themselves minimally frustrated. Many other pro-
teins contain only frustrated foldons, even if the
entire protein recognizes itself in the threading al-
gorithm. At the same time it has been shown that
mean force potentials compiled from the data base
of known structures are not able to discriminate
between the native and alternative states for pro-
teins with large prosthetic groups and for lipophi-
lic proteins (Hendlich et al., 1990; Huang et al.,
1995). Moreover, lattice calculations show that se-
quences designed to t a given compact structure
are much better optimized for their target struc-
tures than a protein sequence is optimized for its
native conformation (Hinds & Levitt, 1996).
From our analysis we also see that the native en-
ergies of foldons scale roughly with size. The linear
correlation coefcient is equal to 0.7, which reects
the signicant scatter of the data-points at large N
(Figure 6). It is not clear that actual free energies
should scale linearly with length even for whole
proteins. If all the interactions were stablizing one-
body terms would scale linearly with size and the
pair-wise interactions would scale linearly with the
number of contacts. If proteins or foldons contain a
large fraction of destabilizing interactions, the cor-
relation between the actual energies and the num-
ber of residues would not be very strong. On the
other hand, this may indicate that empirical energy
functions and threading procedures need to be
improved. We are currently developing algorithms
to improve energy functions via explicit inclusion
of excluded volume and packing density terms.
Methods
Protein data set and backbone representation
The proteins used in our work represent a wide
range of different structural and functional classes
and their coordinates are obtained from the
Brookhaven Protein Data Bank (Bernstein et al.,
1977). Protein names together with the foldon junc-
tions are listed in Table 1. This Table contains 43
proteins determined with a resolution better than
2.6 A

and 190 foldon junctions. Homologous pro-


teins and minor modications of the same protein
are not considered and the nal protein data base
comprise proteins with r.m.s. error of more than
5 A

and sequence similarity exceeding 30%.


We used a reduced backbone representation,
including C
a
of the main chain and C
b
atoms of
the side-chains. The secondary structure assign-
ments were computed via the DSSP procedure
(Kabsch & Sander, 1983). Solvent accessibility was
calculated using the DMS algorithm from the
MIDAS package, where molecular surface was de-
ned according to the criterion described by
Richards (1977).
Optimized energy functions
The empirical Hamiltonian used in the threading
procedure of the sequence-structure alignment is
consistent with the energy function used for foldon
denition and is comprised of prole and tertiary
contact terms (Goldstein et al., 1992b). The prole
term is a pseudo-one-body energy function:
Figure 6. The energy of native state plotted against
length for each of the 190 foldons from the data set.
102 The Foldon Universe
h
p
=
i
g
p
(A
i
,C
i
) where g
p
is dened by the ident-
ify of the residue, A
i
and its context C
i
in the
protein, i.e. surface accessibility (inside or outside)
and secondary structure (helix, sheet, turn, or coil).
A prole Hamiltonian does not describe many
of the specic two-body interactions that give
protein structures their specicity. These are
treated by a contact Hamiltonian: h
c
=
i < j

2
k = 1
g
c
(A
i
, A
j
,)u(r
c
k
r
ij
), where u is a step function with
cut-off distance r
c
k
(r
c
1
= 5 A

r
c
2
= 12 A

). Contact dis-
tances are estimated between C
b
atoms, while for
amino acid Gly coordinates of C
a
atoms have been
used. The parameters g are optimized in such a
way as to maximize the ratio E/dE.
We used two different energy functions which
differed from each other by the type of optimiz-
ation. The original energy function was optimized
to maximize the stability gap in units of the stan-
dard deviation of the molten globule distribution.
Misfolded conformations were generated by
threading the target sequence through the confor-
mations of structurally different proteins
(Goldstein et al., 1992b). Self-consistently optimized
energy functions took into account the partial
order of misfolded states (Koretke et al., 1996). In
this case the stability gap represents the energy
difference between the folded conformation and
thermally occupied minima in the ensemble of
the molten globule conformations. Partially folded
structures were generated by alignment of
trial sequences against known structures. The
Hamiltonian with self-consistently optimized gs
also included a term based on statistics of back-
bone hydrogen bonds.
Alignment procedures
To nd the best match between sequences, we
applied the standard sequence homology technique
based on the Smith-Waterman alignment algorithm
with a Dayhoff scoring matrix (Waterman et al.,
1976). Sequence-structure alignments were based
on the empirical Hamiltonian and implemented in
the mean eld fashion (Goldstein et al., 1993). We
do not allow for gaps in the sequences and struc-
tures of foldons and assume that insertions and
deletions occur between independently folding
units. In order to identify the sub-sequence compa-
tible with a given structure, we searched for the
lowest energy alignment using the Needleman-
Wunsch algorithm (Needleman & Wunsch, 1970)
with an iterative matrix method of calculation. The
elements of the scoring matrix H
ii
/ represent the
energy contribution of amino acid A
i
of target
sequence A embedded into location i
/
of template
structure B, the so-called ``frozen approximation''
(Godzik et al., 1992).
Using the scoring matrix H:
H
ii
/ = g(A
i
Y C
B
i
/ )

k
/
`i
/

2
j =1
g(A
i
Y B
k
/ )u(r
c
j
r
i
/
k
/ ) (1)
we obtain an initial alignment which maps the resi-
due k of sequence A to the residue k
/
of sequence B
(P(k) = k
/
). Then, the up-dated score representing
the energy contribution of residue A
i
in a new
environment can be written as:
H
ii
/ = g(A
i
Y C
B
i
/ )

k
/
`i
/

2
j =1
g(A
i
Y A
k
)u(r
c
j
r
i
/
k
/ ) (2)
The procedure of alignment and calculation of sub-
sequent scoring matrices was repeated and con-
verged within ve iterations in most cases.
Criteria of structural similarity
The Q-score, which measures contact pattern
similarity, was used to determine the degree of
similarity between the structures of target and tem-
plate foldons (Goldstein et al., 1992a). The Q-score
is calculated using a Gaussian function of the inter-
residue distance centered at zero with standard
deviation of [j i[
0.15
A

and normalized by the


number of non-bonded contacts:
Q
AB
=[(N
A
1)(N
A
2)a2]
1

i ,=j
exp (r
A
ij
r
B
i
/
j
/ )
2
a2[ j i [
0X3
(3)
Here, r
ij
r
i
/
j
/ are the distances between C
b
atoms of
residues i and j of protein A and aligned residues i
/
j
/
of protein B, respectively; N
A
is the length of the
target sequence.
Determination of the foldon junctions
Foldon boundaries were determined according
to the following criterion (Panchenko et al., 1996):
the polypeptide chain was cleaved after some resi-
due j and the average -value of the N-terminal
(from the rst residue to residue j) and the C-term-
inal (from residue j to last residue) segments was
computed using
j
= (
N,j

j,c
)/2. The cleavage
point was then moved along the chain, and the
position of the rst local maximum of located
the boundary of the rst foldon. The cleavage pro-
cedure is repeated, with each subsequent foldon
beginning where the previous one was assumed to
end. We believe that the data-based energy func-
tions used here are close enough to actual free en-
ergies such that the calculated foldons are good
rst approximations to the physico-chemical fold-
ing units.
Acknowledgements
A.P. thanks A. Akmal for the careful reading of the
manuscript. Our work was supported by National Insti-
tutes of Health grant PHS R01 GM44557.
The Foldon Universe 103
References
Abkevich, V. I., Gutin, A. M. & Shakhnovich, E. I.
(1994). Specic nucleus as the transition state for
protein folding: an evidence from the lattice model.
Biochemistry, 33, 1002610036.
Bernstein, F. C., Koetzle, T. F., Williams, G. J. B., Meyer,
E. F., Jr, Brice, M. D., Rodgers, J. R., Kennard, O.,
Shimanouchi, T. & Tasumi, M. (1977). The Protein
Data Bank: a computer-based archival le for
macromolecular structures. J. Mol. Biol. 112, 535
542.
Bowie, J. U. & Eisenberg, D. (1994). An evolutionary
approach to folding small a-helical proteins that
uses sequence information and an empirical quiding
tness function. Proc. Natl Acad. Sci. USA, 91, 4436
4440.
Bowie, J. U., Luthy, R. & Eisenberg, D. (1991). A method
to identify protein sequences that fold into a known
three-dimensional structure. Science, 253, 164170.
Bryant, S. H. & Lawrence, C. E. (1993). An empirical
energy function for threading protein sequence
through the folding motif. Proteins: Struct. Funct.
Genet. 16, 92112.
Bryngelson, J. D. & Wolynes, P. G. (1987). Spin glasses
and the statistical mechanics of protein folding.
Proc. Natl Acad. Sci. USA, 84, 75247528.
Bryngelson, J. D. & Wolynes, P. G. (1989). Intermediates
and barrier crossing in a random energy model
(with applications to protein folding). J. Phys. Chem.
93 (19), 69026915.
Bryngelson, J. D., Onuchic, J. N., Socci, N. D. &
Wolynes, P. G. (1995). Funnels, pathways and the
energy landscape of protein folding: a synthesis.
Proteins: Struct. Funct. Genet. 21, 167195.
Chothia, C. (1994). Protein families in the metazoan
genome. Development, S, 2733.
DeBolt, S. E. & Skolnick, J. (1996). Evaluation of atomic
level mean force potentials via inverse folding and
inverse renement of protein structures: atomic bur-
ial position and pair-wise non-bonded interactions.
Protein Eng. 9, 637655.
Fischer, D., Tsai, C.-J., Nussinov, R. & Wolfson, H.
(1995). A 3-D sequence-independent representation
of the protein data bank. Protein Eng. 8, 981997.
Gay, G. P., Ruiz-Sanz, J., Neira, J. L., Itzhaki, L. S. &
Fersht, A. R. (1995). Folding of a nascent polypep-
tide chain in vitro: cooperative formation of struc-
ture in a protein module. Proc. Natl Acad. Sci. USA,
92, 36833686.
Go, N. (1983). Theoretical studies of protein folding.
Annu. Rev. Biophys. Bioeng. 12, 183210.
Godzik, A., Skolnick, J. & Kolinski, A. (1992). Tology
ngerprint approach to the inverse protein folding
problem. J. Mol. Biol. 277, 227238.
Goldstein, R. A., Luthey-Schulten, Z. A. & Wolynes,
P. G. (1992a). Optimal protein-folding codes from
spin-glass theory. Proc. Natl Acad. Sci. USA, 89,
49184922.
Goldstein, R. A., Luthey-Schulten, Z. A. & Wolynes,
P. G. (1992b). Protein tertiary structure recognition
using optimized Hamiltonians with local
interactions. Proc. Natl Acad. Sci. USA, 89, 9029
9033.
Goldstein, R. A., Luthey-Schulten, Z. A. & Wolynes,
P. G. (1993). Protein tertiary structure recognition
using optimized associative memory Hamiltonians.
In Proc. 26th Annual Hawaii International Conference
on System Sciences (Mudge, T. N., Milutinovic, V. &
Hunter, L., eds), vol. 1, pp. 699707, IEEE Compu-
ter Society Press.
Griko, Y. V., Rogov, V. V. & Privalov, P. L. (1992).
Domains in l cro repressor: a calori-metric study.
Biochemistry, 31, 1270112705.
Hendlich, M., Lackner, P., Weitchkus, S., Floeckner, H.,
Froschauer, R., Gottsbacher, K., Casari, G. & Sippl,
N. J. (1990). Identication of native protein folds
amongst a large number of incorrect models: the
calculation of low energy conformations from
potentials of mean force. J. Mol Biol. 216, 167180.
Hinds, D. A. & Levitt, M. (1996). From structure to
sequence and back again. J. Mol. Biol. 258, 201209.
Hirst, J. D. & Brooks, C. L. (1995). Molecular dynamics
simulations of isolated helices of myoglobin. Bio-
chemistry, 34, 76147621.
Holm, L & Sander, C. (1994). Parser for protein folding
units. Proteins: Struct. Funct. Genet, 19, 256268.
Huang, E. S., Subbiah, S. & Levitt, M. (1995). Recogniz-
ing native folds by the arrangement of hydrophobic
and polar residues. J. Mol. Biol. 252, 709720.
Ikura, T., Go, N., Kohda, D., Inagaki, F., Yanagawa, H.,
Kawabata, M., Kawabata, S., Iwanage, S., Noguti,
T. & Go, M. (1993). Secondary structural features of
modules m2 and m3 of barnase in solution by nmr
experiment and distance geometry calculation. Pro-
teins: Struct. Funct. Genet. 16, 341356.
Islam, S. A., Luo, J. & Sternberg, M. J. E. (1995). Identi-
cation and analysis of domains in proteins. Protein
Eng. 8, 513525.
Jennings, P. A. & Wright, P. E. (1993). Formation of a
molten globule intermediate early in the kinetic
folding pathway of apomyoglobin. Science, 262,
892896.
Jones, D. T., Taylor, W. R. & Thornton, J. M. (1992). A
new approach to protein fold recognition. Nature,
358, 8689.
Kabsch, W. & Sander, C. (1983). Dictionary of protein
secondary structure: pattern recognition of hydro-
gen-bonded and geometrical features. Biopolymers,
22, 25772637.
Kippen, A. D., Sancho, J. & Fersht, A. R. (1994). Folding
of barnase in parts. Biochemistry, 33, 37783786.
Klimov, D. K. & Thirumalai, D. (1996). Criterion that
determines the foldability of proteins. Phys. Rev.
Letters, 76, 40704073.
Koretke, K. K., Luthey-Schulten, Z. & Wolynes, P. G.
(1996). Self-consistently optimized statistical mech-
anical energy functions for sequence structure
alignment. Protein Sci. 5, 10431059.
Leopold, P. E., Montal, M. & Onuchic, J. N. (1992). Pro-
tein folding funnelsa kinetic approach to the
sequence structure relationship. Proc. Natl Acad. Sci.
USA, 89 (18), 87218725.
Levitt, M. (1992). Accurate modeling of protein confor-
mation by automatic segment matching. J. Mol. Biol.
226, 507533.
Murphy, K. P., Bhakuni, V., Xie, D. & Freire, E. (1992).
Molecular basis of co-operativity in protein folding:
(III) structural identication of cooperative folding
units and folding intermediates. J. Mol. Biol. 227,
293306.
Murzin, A. G., Brenner, S. E., Hubbard, T. & Chothia, C.
(1995). SCOP: a structural classication of proteins
database for the investigation of sequences and
structures. J. Mol. Biol. 247, 536540.
Needleman, S. B. & Wunsch, C. D. (1970). A general
method applicable to the search for similarities in
104 The Foldon Universe
the amino acid sequence of two proteins. J. Mol.
Biol. 48, 443453.
Onuchic, J. N., Wolynes, P. G., Luthey-Schulten, Z. &
Socci, N. D. (1995). Toward an outline of the topo-
graphy of a realistic protein-folding funnel. Proc.
Natl Acad. Sci. USA, 92, 36263630.
Orengo, C. A., Jones, D. T. & Thornton, J. M. (1994).
Protein super-families and domain super-folds.
Nature, 372, 631634.
Panchenko, A. R., Luthey-Schulten, Z. & Wolynes, P. G.
(1996). Foldons, protein structural modules, and
exons. Proc. Natl Acad. Sci. USA, 93, 20082013.
Richards, F. M. (1977). Areas, volumes, packing and pro-
tein structure. Annu. Rev. Biophys. Bioeng. 6, 151
176.
Rooman, M. J., Kocher, J.-P. A. & Wodak, S. J. (1991).
Prediction of protein backbone conformation based
on seven structure assignments: inuence of local
interactions. J. Mol. Biol. 221, 961979.
Rooman, M. J., Kocher, J.-P. A. & Wodak, S. J. (1992).
Extracting information on folding from the amino
acid sequence: accurate predictions for protein
regions with preferred conformation in the absence
of tertiary interactions. Biochemistry, 31, 10226
10238.
Sali, A. & Blundell, T. L. (1993). Comparative protein
modelling by satisfaction of spatial restraints. J. Mol.
Biol. 234, 779815.
Segawa, S.-I. & Richards, R. M. (1988). Identications of
regions of potential exibility in protein structures:
folding units and correlations with intron positions.
Biopolymers, 27, 2340.
Simon, I., Glasser, L. & Scheraga, H. A. (1991). Calcu-
lation of protein conformation as an assembly of
stable overlapping segments: application to bovine
pancreatic trypsin inhibitor. Proc. Natl Acad. Sci.
USA, 88, 36613665.
Socci, N. D. & Onuchich, J. N. (1994). Folding kinetics of
protein-like hetero-polymers. J. Chem. Phys. 101 (2),
15191528.
Sowdhamini, R., Runo, S. D. & Blundell, T. L. (1996).
A database of globular protein structural domains:
clustering of representative family members into
similar folds. Folding Design, 1, 209220.
Srinivasan, R. & Rose, G. D. (1995). LINUS: a hierarchic
procedure to predict the fold of a protein. Proteins:
Struct. Funct. Genet. 22, 8199.
Unger, R. & Sussman, J. L. (1993). The importance of
short structural motifs in protein structure analysis.
J. Comp. Aid. Mol. Des. 7, 457472.
Waterman, M. S., Smith, T. F. & Beyer, W. A. (1976).
Some biological sequence metrics. Advan. Maths. 20,
367397.
Wetlaufer, D. B. (1981). Folding of protein fragments.
Advan. Protein Chem. 34, 6192.
Wintjens, R. T., Rooman, M. J. & Wodak, S. J. (1996).
Automatic classication and analysis of a a-turn
motifs in proteins. J. Mol. Biol. 255, 235253.
Edited by F. E. Cohen
(Received 18 February 1997; received in revised form 3 June 1997; accepted 3 June 1997)
The Foldon Universe 105

You might also like