You are on page 1of 10

proteins

STRUCTURE O FUNCTION O BIOINFORMATICS

Benchmarking template selection and model


quality assessment for high-resolution
comparative modeling
M. I. Sadowski and D. T. Jones*
Bioinformatics Unit, Department of Computer Science, University College London, London WC1E 6BT, United Kingdom

ABSTRACT INTRODUCTION

Comparative modeling is presently the most Homology modeling remains the only practical method of predict-
accurate method of protein structure prediction. ing protein structure with high accuracy. Despite many years of
Previous experiments have shown the selection intense research effort into the prediction of structures from estab-
of the correct template to be of paramount im- lished physical principles, the best way to predict the structure of a
portance to the quality of the final model. We protein is still to search for a protein of known structure, which is
have derived a set of 732 targets for which a
likely to adopt the same fold.1
choice of ten or more templates exist with 30
The basis of this method is the observation by Chothia and Lesk2
80% sequence identity and used this set to com-
pare a number of possible methods for template that closely homologous proteins almost invariably share the same
selection: BLAST, PSI-BLAST, profileprofile overall structure, and that the degree of core structural similarity can
alignment, HHpred HMMHMM comparison, be quantified as a function of the identity between their sequences.
global sequence alignment, and the use of a Subsequent work has refined our understanding of this relationship35
model quality assessment program (MQAP). In to account for improved methods for assessing sequence similarity;
addition, we have investigated the question of however, the overall functional form remains the same.
whether any structurally defined subset of the Generating a template-based model is generally separated into the
sequence could be used to predict template qual- following steps: template identification, target-template alignment,
ity better than overall sequence similarity. We model generation, and model refinement.6,7 Experience of blind
find that template selection by BLAST is suffi-
modeling tasks in the CASP experiments813 has shown that template
cient in 75% of cases but that there are exam-
selection is still the most crucial step to the production of a high
ples in which improvement (global RMSD 0.5 A
or more) could be made. No significant quality model and that only in rare cases can current methods pro-
improvement is found for any of the more so- duce models exceeding the template-target similarity.12,13
phisticated sequence-based methods of template It follows that the two most crucial steps for a high-quality final
selection at high sequence identities. A subset of model are the selection of the best possible template and generation
118 targets extending to the lowest levels of of the correct alignment of the target sequence to the template struc-
sequence similarity was examined and the ture. The question of generating a correct sequencestructure align-
HHpred and MQAP methods were found to ment for modeling is a well-studied one, and many techniques have
improve ranking when available templates had been proposed for this problem.1419 Conversely, the question of
3540% maximum sequence identity. Structur- how to select the best template from a set of alternatives has not pre-
ally defined subsets in general are found to be
viously received much attention.
less discriminative than overall sequence simi-
The reason that this question arises is that although the general
larity, with the coil residue subset performing
equivalently to sequence similarity. Finally, we form of the sequencestructure relationship is well understood and
demonstrate that if models are built and model follows the results of Chothia and Lesk,2 there is a high variance in
quality is assessed in combination with the the structural similarity (Fig. 1). Once there is a choice of possible
sequence-template sequence similarity that a
extra 7% of best models can be found.
Grant sponsor: Biosapiens Network of Excellence (funded by the European Commission within its
Proteins 2007; 69:476485. FP6 Programme, under the thematic area Life sciences, Genomics and Biotechnology for Health);
C 2007 Wiley-Liss, Inc.
V
Grant number: LSHG-CT-2003-503265.
*Correspondence to: D.T. Jones, Bioinformatics Unit, Department of Computer Science, University
Key words: protein structure prediction; homol- College London, London WC1E 6BT, United Kingdom. E-mail: d.jones@cs.ucl.ac.uk
Received 10 October 2006; Revised 23 February 2007; Accepted 20 March 2007
ogy modeling; bioinformatics; profile-profile Published online 10 July 2007 in Wiley InterScience (www.interscience.wiley.com).
alignment; high-resolution modeling; MQAP. DOI: 10.1002/prot.21531

476 PROTEINS C 2007 WILEY-LISS, INC.


V
Benchmarking Template Selection

programs (MQAPs). Finally, we investigate which of the


number of structurally-defined subregions of the alignment
are most informative for predicting template quality.

MATERIALS AND METHODS

Benchmark set

The first step was to derive a benchmark set of protein


structures with ten or more structurally characterized
homologs within 3080% sequence identity. The starting
point for the set was the CATH database (release
2.6.026). We derived pairwise sequence identities for all
CATH families based on structural alignments generated
with TMAlign.27 We then filtered the set of families to
Figure 1
The relationship between sequence identity and global RMSD. 57,760 domain
exclude all pairs with sequence identities outside the 30
pairs from the dataset derived for the study (see Methods) are plotted. Outlying 80% range, membrane protein structures and crystal
values have been trimmed as in Wilson et al.20 to reveal the overall structures with >2.5-A resolution.
relationship.
From the filtered list of CATH family members, we
then selected all domains for which ten or more tem-
plates remained. This resulted in a list of 732 target
templates with similar sequence identities for a particular domains, forming the target list of structures for the
target, it may be difficult to reliably choose the most benchmark set. Each domain was paired with its list of
similar. ten or more templates, for a total of 958 domains in the
Since the accumulation of structural information will set (accounting for the large number of cases in which
continue to provide a greater number of alternatives for domains appear as templates for different targets, or
use when modeling, particularly in cases where the pro- both as templates and targets) with 59,897 total pairs in
tein is of medical interest or biological significance, the the set.
importance of this question will increase over the next Structural similarities were derived for models derived
few years. using MODELLER with no energy minimization options
Recommendations for the template selection step vary as described below in the section on MQAP assessments.
slightly between authors: some recommend using We used a sequence-dependent global RMSD measure to
BLAST.7 Others recommend using PSI-BLAST regardless, assess results.
and this is the default approach in one recent homology To compare our results with those for more distant
modeling program, MOLIDE.21 Marti-Renom et al.6 templates, we took a subset of 118 targets (at most 5
suggest generating preliminary models and assessing their from each of the 24 CATH families) and created template
quality. It is not clear which strategy is preferable, or if lists without the lower sequence similarity limit used
more sophisticated methods such as profileprofile align- above. An additional 4675 models were generated from
ments or selecting a subset of the sequence may be use- 1024 extra templates for these 118 targets.
ful. The success of profileprofile approaches in recent
automated structure prediction exercises such as Live- Template selection methods
Bench2225 may be seen by some to suggest this.
BLAST/PSI-BLAST searches
To investigate this question, we derived a set of cases
for the rigorous assessment of methods for selecting tem- Sequence similarity searches using BLAST and PSI-
plates. The specific problem that we address is the ques- BLAST28 were conducted using all 732 target domain
tion of selecting the best possible template when model- sequences as queries. To ensure a ranking for the full list
ing a single protein domain (which may or may not of templates was generated, we derived a BLAST sequence
derive from a multidomain structure) in cases where database containing only the 958 target and template
there are ten or more potential templates available for domains using sequences derived from the ATOM re-
homology modeling (homologues with 3080% sequence cords of the PDB files.
identity) in a dataset derived from the CATH database. BLAST searches were run against the database of tem-
We examine the performance of standard methods plate sequences directly. PSI-BLAST was used to generate
(BLAST, PSI-BLAST) on the problem and explore the checkpoint files with the UniProt database (version 6.2;
effectiveness of more sophisticated approaches: profile www.uniprot.org) over a given number of iterations,25
profile alignments, HMMHMM comparison, and assess- following which the template sequence database was
ing preliminary models with model quality assessment searched with the resulting profile.

DOI 10.1002/prot PROTEINS 477


M.I. Sadowski and D.T. Jones

Global sequence/profile alignments full-atom models using structural alignments generated


by TMalign.
Domain sequences were also aligned directly by global
SmithWaterman alignment29 using the BLOSUM 50
matrix30 and gap penalties of (12, 1). PSI-BLAST profiles
Subset selection and scoring
generated in the step above (at four iterations) were also
used in this process, either for the template or for the We examined the conservation of structurally-deter-
target sequence. Gap penalties of (12, 1) were again used mined subsets of the template sequence as a possible pre-
for these alignments. dictor of template-target similarity. We selected subsets
using either secondary-structure state, solvent accessibility
Profile/profile alignments or proximity to atoms in HETATM records as criteria.
The rationale for the latter method was as a straightfor-
We generated profileprofile alignments using the Iter-
ward method of deriving potentially functional residues
ation 4 PSI-BLAST profiles and a global SmithWater-
using only structural information. To augment this, we
man alignment algorithm with affine gap penalties. We
also incorporated information from SITE records in the
were interested in optimizing the profileprofile align-
PDB files.
ments and therefore examined three scoring functions,
Accessibility criteria were based on the calculated val-
each with 66 gap-penalty pairs as described below. To
ues from the DSSP program.36 These were converted to
make the task tractable in reasonable time, we explored
percentage accessibilities taking accessible surface areas
the parameter space using a 10% random subset of the
for G-G-X-G-G pentapeptides used by mGen-
732 targets.
THREADER37 (where X is the amino-acid of interest) as
The three score functions investigated were the Pearson
100%. These were selected in bands of 10%.
correlation coefficient (CC), dot product (DP), and Eu-
Secondary structures were converted to a three-state
clidean distance measures, using the same methodology
representation from the DSSP output by mapping EABP
as Marti-Renom et al.31 In each case, the vectors and
to strands, H to helices, and all others to coils. We exam-
score functions were normalized to produce scores in the
ined results for selecting helices, strands, and coils sepa-
range 01000. We also included the pseudocount correc-
rately and also for selecting both helices and strands.
tion to account for the amino acid background frequen-
Putatively functional residues were selected using dis-
cies suggested by Ohlson et al.32 Gap opening penalties
tance cutoffs between any HETATM (except SO4 and
were investigated in increments of 50 from 100 to 600 in-
H2O) and specific atoms in the amino acid residues. We
clusive; gap extension penalties were investigated in
investigated successively more stringent cutoffs based on
increments of 10 from 0 to 50 inclusive.
distances to three atom types: distance to the a-carbon,
HHpred
distance to the b-carbon, or distance to any side chain
atom. These three methods were tested with cutoffs of 5,
Four iterations of PSI-BLAST were run as detailed 10, 15 A for Ca distances; 5, 10 A for Cb distances; and
above for all target and template sequences including 5, 8 A for side chain distances. For glycine, the latter two
those in the extended low sequence identity set. The distances were treated as a-carbon distances.
HHpred software (v 1.2.033;) was used to extract multi-
ple sequence alignments from the output files and gener-
ate hidden Markov model (HMM) profiles. All target Conservation scoring
sequences were used to scan a custom database consisting
Sequence similarity measures were derived using the
of the target and template HMMs. The HMMs were
BLOSUM 62 matrix30 with the Cvaldar conservation scor-
uncalibrated and default parameters were used except to
ing38 to normalize residue similarity scores to the inter-
make the results output as inclusive as possible. Template
val [0,1]. Sequence and subset similarities were taken as
rankings were generated using the raw scores generated
the arithmetic means of conservation scores for all posi-
by the software.
tions in the selected set. Positions containing gaps were
ignored. Where necessary (e.g., when comparing to the
MODCHECK
TM-score, which incorporates a scaling factor related to
Another possible strategy for template selection would the length of the target sequence), the conservation score
be to use a model quality assessment program (MQAP) was scaled by multiplication with the proportion of tar-
for ranking models. The potential advantage is that this get sequence coverage.
allows the suitability of three-dimensional contacts to be In all cases, we used the structural alignments gener-
assessed directly by the method. We used our own ated with TMAlign to avoid the problem of misalign-
MQAP, MODCHECK34 to rank models of the target ment by sequence-based methods. Although the struc-
structures. We used the program Modeller35 with the tural alignment between two proteins is not necessarily
model routine and no energy minimization to generate unique or definitive, the structures in question are very

478 PROTEINS DOI 10.1002/prot


Benchmarking Template Selection

similar and we observed that TMAlign generated very those found in highly resolved crystal structures.
similar alignments and RMSD values to SAP for this set. Rather than using fixed bins, probabilities of particu-
lar main chain and side chain conformations were
obtained by calculating the mean absolute difference
Environmental factors
across all corresponding angles between the target resi-
To follow up the suggestion that biophysical factors due and all of the residues of the same type in the
relating to the crystallization of the protein may be im- high resolution set. A threshold of 108 was found to
portant, we assessed whether differing pH, temperature, be optimal. Any residue in the data set with torsion
quaternary state, resolution, and space group might have angles within this threshold was counted as a positive
an effect on the choice of template. We derived the reso- observation. For side chains, both the main chain and
lution, pH, temperature, and space group information side chain torsion angles are compared to ensure that
from the REMARK 2, 200, 200, and 290 records of the the rotamer probabilities are conditional on the main
original PDB files, respectively. Quaternary structure in- chain conformation. The logarithms of the observed
formation was derived from the PQS database.39 relative frequencies were used to provide additive
These factors were then combined with sequence iden- energy-like terms.
tity values to determine whether any beneficial effects on 5. van der Waals interactions: We calculate a standard
the ranking were obtained. For quaternary structure and LennardJones potential for all nonbonded atoms in
space group information, we determined if the values the model. Attractive and repulsive terms are summed
were identical, different, or missing. Values of 1, 0, and separately and the potential is softened at close
0.5 were assigned in each case respectively. Temperature atom separations as proposed by Kuhlman et al.43
and pH values were assessed as the difference between 6. Stereochemistry: Finally, we take the summary scores
the target and template values. Where one or both values generated by Procheck44 which evaluate the overall
were missing a value of zero was assigned. Resolution in- stereochemical quality of the model.
formation for the template structure was used to mark
each pair. In each case, we added or subtracted (as
The above methods produced a total of 30 features
appropriate) the given value times a weighting factor to
and to calculate a single overall score, each of the 30
the sequence identity score. Weights were varied over a
terms were combined linearly and assigned variable
large range (34 orders of magnitude) to determine
weights, which were optimized using a nongradient-based
whether any value would effect an improvement.
pattern search optimization algorithm over the decoy set
of Tsai et al.45 The target function in this case was the
High resolution model quality assessment SSE (sum of squared error) between the weighted sums
of terms and the RMSDs between the decoy structure
To try to bring in as much structural information as
and the experimentally observed structure. Since this
possible to the quality assessment of final models, we
method involves assessment of hydrogen bonding quality,
combined the following broad range of component fea-
we also assessed models with additional side chain opti-
tures to form a new MQAP called high resolution model
mization using SCWRL.46
quality assessment (MODCHECK-HD):

1. Original MODCHECK scores: Scores on sequence-to- RESULTS


structure compatibility are produced using the MOD-
CHECK program.34 This program makes use of pair Benchmark set
potentials of mean force. The composition of the benchmark set of 732 protein
2. Hydrogen bonding: The HBPLUS program40 was domains ordered by CATH code is shown in Table I. As
used to find all hydrogen bonds in a given model a consequence of the procedure used for constructing the
involving both side chain and main chain groups. A dataset, we found that a little over half of the dataset
score function was used based on a directional Morse were structures from the immunoglobulin superfamily
potential to preferentially score residues, which have (IG; CATH code 2.60.40.10), many of which provided
ideal geometry.41 Hydrogen bonds involving side more than one hundred possible templates for the chosen
chains were summed separately, as were hydrogen target. To exclude the possibility that this might bias the
bonds within helices and turns (sequence separation 6 set, we report results including and excluding these
or less). sequences.
3. Solvation: The solvation potential proposed by Lazari-
dis and Karplus42 was used to evaluate detailed
Assessing model quality
atomic solvation for each model.
4. Main chain and side chain torsion angles: The main The RMSD score between two different protein chains
chain and side chain torsion angles were compared to can be hard to define unambiguously. Initial analyses

DOI 10.1002/prot PROTEINS 479


M.I. Sadowski and D.T. Jones

Table I
Composition of the Dataset

CATH code Brief name N


1.10.10.60 Arc-repressor, homeodomain-like 3
1.10.220.10 Annexin V 23
1.10.238.10 EF-Hand 26
1.10.490.10 Globin 33
1.10.490.20 Phycocyanin 1
1.10.510.10 Phosphotransferase 18
1.10.530.10 Lysozyme 13
1.10.760.10 Cytochrome C 19
1.20.90.10 Phospholipase A2 34
2.10.60.10 CD59 5
2.30.30.40 SH3 11
2.40.70.10 Cathepsin D, subunit A, domain 1 24
2.40.128.20 Lipocalin 11
2.60.40.10 Immunoglobulin 377
Figure 2
2.60.40.420 Cupredoxin 4 Performance of standard methods on template selection. The results of applying
twelve methods to selecting template structures for the 732-member benchmark
2.60.120.200 Jellyroll (Agglutinin) 17
set are shown. For each method, the proportion of the target set for which the
3.10.20.30 Ubiquitin-like 1 top-ranked template selected is within a given RMSD value of the best possible
3.10.100.10 Mannose-binding protein A, subunit A 17 for the set is reported. Identical PSI-BLAST iterations are not shown. Columns
3.20.20.80 TIM Barrel (Glycosidase) 11 are arranged left to right in decreasing order of the value in the first bin.
3.30.200.20 Phosphorylase Kinase; domain 1 17 Methods shown are: BLAST, PSI-BLAST (pb_1-5); global alignment of template
3.30.500.10 Murine class I major histocompatibility 3 and target sequences (b50global), target profile versus template sequence
complex, H2-DB, subunit A, domain 1 (profseq) and target sequence versus template profile (seqprof); profileprofile
3.40.50.300 Rossmann fold (P-loop NTPases) 13 alignments with three scoring functions (Euclidean distance: pp_ed, correlation
3.40.50.720 NAD(P)-binding Rossmann-like domain 15 coefficient: pp_cc, dot product: pp_dp), HMMHMM comparison with HHpred
(hhpred) and assessing models generated with MODELLER using the MQAPs
3.40.390.10 Collagenase (catalytic domain) 5
MODCHECK (modchk) or the new MQAP (modchk-HD). The optimized
3.40.710.10 DD-peptidase/b-lactamase superfamily 11 combination of the new MQAP and BLAST bit scores is plotted as mchd-blast.
3.90.70.10 Cathepsin B domain A Cysteine proteinase 19 Also shown are the results of always choosing the second or third best possible
3.90.110.10 L-2-hydroxyisocaproate dehydrogenase, 1 template (labelled secbest, thrdbest, respectively), to illustrate upper bounds on
subunit A, domain 2 performance, and selecting a template at random (random) as a negative
control.
The CATH codes and descriptions of the fold group (and homologous superfam-
ily where necessary) are shown with the number of targets in the dataset.

a good template, finding a template within 0.25-A RMSD


were performed using RMSD values from SAP and TMa-
of the best possible for 58% of the target set with immu-
lign and more sophisticated measures such as the GDT-
noglobulins included or 61% with immunoglobulins
TS and TM scores. We found the GDT-TS and TM-
excluded (Fig. 3), 75% at 0.5 A or better in both cases. It
scores to be insufficiently sensitive for such close levels of
is interesting to compare this with the alteration in core
similarity. However, both SAP47 and TMalign48 generate
RMSD generated for MODELLER models (Fig. 4).
RMSD values for a subset of residues, which can vary
considerably between different members of a family when
there are reasonably sized structurally divergent regions.
We therefore used MODELLER to derive full models from
structural alignments generated with TM-align and gener-
ated backbone RMSDs using a sequence-dependent super-
position, since this was found to provide more stable esti-
mates of target-template similarity. In addition, this has
the advantage that MQAP-based scores are more meaning-
ful and addresses the question of template selection in the
light of which produces the best model, which may not
bear a simple relation to the target-template RMSD.

Performance of standard methods

The results of the BLAST, PSI-BLAST, profile-profile,


global sequence-sequence alignment and MQAP based
Figure 3
methods of template selection as judged by the global Template selection performance without immunoglobulin sequences. The plot is
RMSD are shown as a histogram in Figure 2. It is clear the same as Figure 2 for the dataset with immunoglobulin sequences removed.
from this figure that BLAST is very effective at choosing

480 PROTEINS DOI 10.1002/prot


Benchmarking Template Selection

results generated by Modcheck, identifying a target


within 0.5 A of the best possible for 58% of the target
set.
The MQAP MODCHECK, which uses classical pair-
potentials to assess model quality, performed poorly
overall compared with the sequence-based methods, cor-
rectly identifying a model within 0.5 A of the best possi-
ble for only 56% of the target set. We therefore investi-
gated the use of a novel MQAP, MODCHK-HD, using a
wide range of structure-based calculations including
hydrogen bonding and van der Waals interactions (see
Methods). The new MQAP method was substantially bet-
ter than MODCHECK at this task, finding models within
0.5 A of the best possible for 74% of the target set, only
Figure 4 1% worse than BLAST. Assessment of models further
Comparison of template-target RMSDs with model-target RMSDs. Models were optimized with SCWRL did not affect performance (data
generated using MODELLER with no energy minimization (see Methods).
RMSD values were assessed using TMAlign.27
not shown). To exploit the potential of both methods, a
hybrid method for predicting model-target RMSDs using
a weighted combination of the MODCHECK-HD values
and the BLAST bit scores was developed. Optimizing the
weight of the MODCHECK-HD score to 60 resulted in a
We experimented with several possible sets of BLAST moderate improvement in performance, shifting 52 tar-
parameters including different gap penalties, use (or not) gets (7% of the set) to within 0.5 A of the best possible
of full SmithWaterman alignments and the BLOSUM 80 RMSD relative to BLAST alone.
matrix to account for the short evolutionary distance. Combining environmental factors (pH, equivalent qua-
The ordering of the structures in the results did not ternary structure, temperature, space group) did not pro-
change under any of these circumstances, as a conse- duce any improvement in template identification by
quence of the high level of identity between the templates sequence identity with any weighting. Increasing the
and the targets. weighting of these environmental factors served only to
PSI-BLAST does not offer any significant improvement decrease the performance relative to BLAST alone (data
over BLAST, finding a target within 0.5 A of the best not shown).
possible for 76% of the target set. This result was also ro-
bust to changes in PSI-BLAST parameters including the
above and also the profile inclusion threshold. Changes
At which point do more sophisticated
were observed with very extreme changes in the profile
methods become useful?
exclusion threshold; however, they always represented
a substantial degradation of performance (data not Since profileprofile and HMM-comparison based
shown). methods are generally known to improve fold recognition
Global alignment of target profiles with template performance,2225,3133 we were interested in establish-
sequences resulted in the same performance as PSI- ing whether a point beyond which BLAST template rank-
BLAST, 76% of targets within 0.5 A of the best possible ings were inferior to those derived by the more sophisti-
template. Using target sequences with template profiles cated profile comparison methods.
resulted in slightly worse performance, 74% of the target To investigate this, we extended a subset of 118 of the
set coming within 0.5 A of the best possible model. template lists (at most five targets from each of the 24
We found that using profileprofile alignments provide CATH families in the original data set) to include poten-
a very modest improvement in final model quality, tial templates below 30% sequence identity. For these, we
although the results were very sensitive to the choice of generated results using BLAST, PSI-BLAST (four itera-
scoring function and gap-penalties. We found that the tions), Modcheck, the three profileprofile methods,
dot-product score function provided the best perform- HHpred, and model quality scores with sequence-de-
ance, identifying a target within 0.5 A of the best possible pendent superpositions using MODELLER models. Mod-
for 77% of the target set with gap-penalties of (100, 20). els were gradually excluded from the dataset based on
The correlation coefficient score function gave perform- sequence identities derived from TMalign structural
ance roughly equal to BLAST and the Euclidean distance- alignments in steps of 5% from 80% down to 5%. The
based score function significantly lower performance at numbers of models available at each point were as fol-
67%. The HHpred software results for the high sequence lows: 7478, 7458, 7410, 7326, 7153, 6997, 6704, 6338,
identity set were found to be slightly inferior to the 5937, 5517, 5052, 4675, 4021, 3385, 2246, and 595.

DOI 10.1002/prot PROTEINS 481


M.I. Sadowski and D.T. Jones

Figure 5 Figure 6
Comparison of BLAST and HHpred performance for template sets with Conservation of accessibility defined sequence subsets as predictors of template
increasing evolutionary distance. Cumulative proportion of the 118-member set similarity. The axes are as in Figure 2. Residues were divided into bins by
of targets with low sequence identity templates (see text) correctly identified with solvent accessibility in bands of 10%. The conservation of each bin was
d-RMSD within 1.5 A is plotted against the maximum sequence identity of any calculated using a conservation metric based on the BLOSUM 62 matrix.
template on the x-axis. Structural alignments were used to minimize alignment error. For comparison,
the performance of the whole sequence is also plotted.

Both Modcheck and HHpred were found to improve


on the ranking with the crossover taking place between defined subsets performed as well as overall sequence
35 and 40% sequence identity. A comparison of HHpred similarity, with performance noticeably decreasing with
and BLAST is shown in Figure 5. The proportion of tar- increasing accessibility.
gets for which d-RMSD is below 1.5 A is plotted. The Functionally-defined subsets are difficult to assess since
location of the crossing point was not found to alter many of the templates do not have any ligands bound.
with this threshold; hence, the value of 1.5 A was chosen This changes the properties of the subset so results are
as a point with a large dynamic range for sake of clarity. not strictly comparable to those above, and reduces the
Data for below 15% sequence identity are not plotted as set of targets to 227. As a guideline we can assess the
results become unreliable at this point as a consequence likelihood of improvement by comparing the results with
of a reduction in data set size (number of targets with very large thresholds (i.e. a-carbon within 15 A) with
three or more possible templates drops below 70). those for more stringent cutoffs. Since we are selecting
The other profileprofile methods all monotonically
decrease in performance with sequence identity threshold
(data not shown), which we believe is a consequence of
having optimized the methods for higher sequence-simi-
larity ranges (gap penalties, use of global alignment, etc.).

Subset measures

We also investigated whether assessing the conservation


of structurally-defined subsets of the sequence might
assist template selection. Subsets were selected based ei-
ther on solvent accessibility, secondary structure, or prox-
imity to nonprotein atoms (except water or sulfate ions).
The last set was defined to approximate the use of func-
tional information on the basis that close homologs with
conserved function should be more structurally similar
than close homologs with unconserved function.
The results of the subset selection methods are shown Figure 7
in Figures 68. In comparison with using the overall Template selection performance of subsets defined by secondary structure. The
plot is the same as Figure 3 for subsets defined by secondary-structural state (as
sequence conservation none of the subsets offers any identified by DSSP). The performance of full sequence similarity is plotted for
improvement, although the coil subset shows almost the reference.
same level of performance. None of the accessibility-

482 PROTEINS DOI 10.1002/prot


Benchmarking Template Selection

cases, but still fails to find the correct template in a sig-


nificant number of cases. In general, we find that BLAST
identifies the best template according to global RMSD or
TM-Score only in a minority of cases (2540%) but it
finds a reasonable template (within 0.5 A global RMSD
or 0.05 TM Score units) in the majority (7580%). If we
require only a template within 1 A of the best possible,
we find that all the methods tested perform similarly,
succeeding for 90% of the target set.
Thus, although it is usually sufficient to take the top
BLAST hit as a template for homology modeling, there
are clearly cases in which a more sophisticated approach
is necessary to find the best starting template. Our results
suggest that some of the more sophisticated approaches
Figure 8 tested can offer an improvement in many of these cases;
Template selection performance of subsets defined by proximity to nonprotein however, at present, it is not clear exactly how best to
constituents of PDB files. The plot is the same as Figure 3 for subsets defined by combine them so as to maximize correct selection in ev-
proximity to nonwater and nonsulfate-ion HETATMs. Residues are included in
the subset if a specific atom type is within a given radius of any atom of the
ery situation. We intend to pursue this question further
hetgroup. Subsets are identified by the atom type and the radius in Angstroms. in the future.
Atom types were: a-carbon (ar), b-carbon (br) or any side chain atom (scr). As expected, we found that the ability of the more so-
The performance of overall sequence similarity was derived separately for this set
and is plotted for reference. The target set used in this instance does not contain phisticated HMMHMM comparison method HHpred
immunoglobulin structures. to rank very similar targets was inferior to that of sim-
pler approaches (BLAST, PSIBLAST) but that once tem-
plate-target similarity drops below 40% identity, the
successively smaller proportions of the sequence we HHpred method ranks the templates more accurately.
expect that performance will degrade when fewer residues The MODCHECK MQAP utility, similarly optimized on
are selected (as happens with random selections). This is benchmarks using more divergent models, shows the
what we observe with the function-defined subsets. It same behavior. This indicates that a combined strategy
therefore seems that if conservation of function does for selecting templates should be adopted with the cross-
imply greater structural similarity this method does not over point set at roughly 40% ID.
reliably capture it. Since this is the first full study of template selection
per se, we cannot compare this result directly with any
previous results; however, Contreras-Moreira et al.49 pro-
DISCUSSION vided some cursory data using 226 SCOP families that
the best template could be found by BLAST in 75%
A benchmark set for detailed
homology modeling
of the time. Although it is not entirely clear how the
best template was defined in this case, we have ob-
Our results demonstrate that much work is needed to tained a similar result, with the additional finding that
produce homology models for proteins of unknown more sophisticated methods do not improve the per-
structure that are correct in fine details even at high lev- formance.
els of sequence similarity. Aside from the obvious prob- Some authors6,50 suggest that the environment of the
lems of loop modeling and finding structures for N- and structurepH, cocrystallization with similar ligands,
C- termini not covered by templates, the alterations multidomain context, etc.is an important considera-
introduced into the core structure when filling gaps tion in template selection. We have investigated these
almost invariably worsen the model quality. Our dataset, claims using pH, space group, temperature, resolution,
along with our analysis of results, provides a good and quaternary structure, and concluded that these fac-
benchmark set for use in developing and testing methods tors do not provide a significant degree of extra informa-
for the improvement of methods for detailed homology tion for determining the structure of the single domain.
models. The dataset, together with raw data, is available Since the data required is not always available, these
for download from our website (http://bioinf.cs.ucl.ac.uk/ results must be treated with a little caution; however, the
biosapiens/modeling/). general conclusion seems reasonably clear. Obviously,
there are cases in which these factors may be important
for a particular model and the question of cocrystalliza-
Template selection is not always trivial
tion is very difficult to assess on such a large dataset;
The main conclusion of the study is simply that however, the question of what precisely determines the
BLAST is quite adequate for template selection in most difference between structures at this level remains open.

DOI 10.1002/prot PROTEINS 483


M.I. Sadowski and D.T. Jones

Conservation of structurally-defined ship is not sufficiently clear to ensure selection of the


subsets is no more informative than best available template for generating a homology model
conservation of the whole sequence
when several choices are available. We have shown that
Since different parts of the sequence are known to be this general relationship allows the prediction of 75% of
of different importance in defining the structure of a our target set to within 0.5 A of the best possible (90%
protein (e.g. by /-value analysis), we are interested in to within 1 A) using BLAST alone. Using more sophisti-
the idea that identifying these residues and assessing their cated methods such as PSI-BLAST and profileprofile
conservation may be a useful way to improve recognition methods did not provide a significant improvement, pos-
of similar structures at all levels of similarity. sibly as a result of dilution of the basic signal of sequence
In this article, we have examined one simple way in identity. Although this is a high level of accuracy, it is
which such a subset might be defined, using structural still not sufficient to avoid making noticeable errors
criteria that are straightforward to derive. Although we when homology modeling predictions are made auto-
did not find that any of the subsets performed better matically for entire proteomes.
than the overall similarity of the sequence, we have found We also combined target-template sequence similarity
that the set of coil residues performs equally well despite measures (BLAST bit scores) with a highly detailed set of
ignoring 30% of the sequence on average. structure-based model quality criteria (MODCHECK-
It seems also that some of the most buried (and a few HD). These results did show a useful improvement over
of the most exposed) residues may also be useful in this BLAST for the set of highly similar templates, adding an
definition; however, the broad selection strategy that we extra 52 cases (an extra 7%), which could be accurately
have used has not permitted selection of useful subsets modeled compared to BLAST alone. The downside in
so far. We are presently investigating a number of other this case is that the models have to be built before the
strategies for identifying such residues. model quality can be assessed, and so both alignment
and modeling errors will get in the way of finding the
Optimal choice of score function for
best templates.
profileprofile alignment relates to the Finally, we identified the point at which profileprofile
measure of success methods outperform single-sequence-based methods in
ranking templates by structural similarity. We found that
When using profileprofile-based scores, we observed below 40% sequence identity, it is preferable to use
that the choice of score function and gap penalties was MODCHECK or HHpred or similar scores in selecting
very important for good performance and that different targets, whereas above this level the results of simpler
functions performed better for different measures of tem- methods such as BLAST should be preferred. This result
plate similarity. In general, it is necessary to balance the should be useful in the future development of hybrid
overall RMSD with the coverage of the target, in terms servers, which aim to perform with high accuracy across
of the number of equivalent residues. the full range of template-based modeling tasks.
We found that the dot product and correlation coeffi-
cient score functions were both better for predicting the
coverage of the target by the template structures and the ACKNOWLEDGMENTS
TM-score, which is strongly correlated with the coverage We thank Dr. Kevin Bryson for commenting on an
of the target. Conversely, the Euclidean distance score earlier version of the manuscript and the anonymous
function was very poor at predicting coverage values but reviewer for useful suggestions.
performed well at predicting RMSDs. When assessing
final model RMSDs, the dot-product score function was
the most successful, indicating that it achieves a good REFERENCES
balance between predicting coverage and RMSD.
Previous studies of the choice of score function for 1. Floudas CA, Fung HK, McAllister SR, Monnigmann M, Rajgaria R.
Advances in protein structure prediction and de novo protein
profileprofile alignment31,32,51,52 have focused on design: a review. Chem Eng Sci 2006;61:966988.
optimizing for detection of remote homology and gener- 2. Chothia C, Lesk AM. The relation between the divergence of
ating correct alignments. Our results indicate that their sequence and structure in proteins. EMBO J 1986;5:823826.
results do not directly transfer to the question of cor- 3. Sander C, Schneider R. Database of homology-derived protein
rectly ranking homologs for structural similarity. structures and the structural meaning of sequence alignment. Pro-
teins 1991;9:5668.
4. Rost B. Twilight zone of protein sequence alignments. Protein Eng
1999;12:8594.
CONCLUSION 5. Wood TC, Pearson WR. Evolution of protein sequences and struc-
tures. J Mol Biol 1999;291:977995.
Although the general nature of sequencestructure 6. Marti-Renom MA, Stuart AC, Fiser A, Sanchez R, Melo F, Sali A.
relationships follows the form originally observed by Comparative protein structure modeling of genes and genomes.
Chothia and Lesk,2 the detailed structure of the relation- Ann Rev Biophys Biomol Struct 2000;29:291325.

484 PROTEINS DOI 10.1002/prot


Benchmarking Template Selection

7. Krieger E, Nabuurs SB, Vriend G. Homology modeling. In: Bourne 28. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W,
PE, Weissig H, editors. Structural Bioinformatics. New York: Wiley- Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of
Liss; 2003. pp 507521. protein database search programs. Nucleic Acids Res 1997;25:3389
8. Mosimann S, Meleshko R, James MNG. A critical assessment of 3402.
comparative molecular modeling of tertiary structures of proteins. 29. Smith TF, Waterman MS. Identification of common molecular sub-
Proteins 1995;23:301317. sequences. J Mol Biol 1981;147:195197.
9. Martin ACR, MacArthur MW, Thornton JM. Assessment of com- 30. Henikoff S, Henikoff JG. Amino-acid substitution matrices from
parative modeling in CASP2. Proteins 1997;S1:1428. protein blocks. Proc Natl Acad Sci USA 1992;89:1091510919.
10. Jones TA, Kleywegt GJ. CASP3 comparative modeling evaluation. 31. Marti-Renom MA, Madhusudhan MS, Sali A. Alignment of protein
Proteins 1999;S3:3046. sequences by their profiles. Protein Sci 2004;13:10711087.
11. Tramontano A, Leplae R, Morea V. Analysis and assessment of com- 32. Ohlson T, Wallner B, Elofsson A. Profile-profile methods provide
parative modeling predictions in CASP4. Proteins 2001;S5:2238. improved fold-recognition: a study of different profile-profile align-
12. Tramontano A, Morea V. Assessment of homology-based predic- ment methods. Proteins 2004;57:188197.
tions in CASP5. Proteins 2003;53:352368. 33. Soding J. Protein homology detection by HMM-HMM comparison.
13. Tress M, Ezkurdia I, Grana O, Lopez G, Valencia A. Assessment of Bioinformatics 2005;21:951960.
predictions submitted for the CASP6 comparative modelling cate- 34. Pettitt CS, McGuffin LJ, Jones DT. Improving sequence-based fold
gory. Proteins 2005;S7:2745. recognition by using 3D model quality assessment. Bioinformatics
14. Alexandrov NN, Luethy R. Alignment algorithm for homology 2005;21:35093515.
modeling and threading. Protein Sci 1998;7:254258. 35. Sali A, Blundell TL. Comparative protein modelling by satisfaction
15. Sauder JM, Arthur JW, Dunbrack RL. Large-scale comparison of of spatial restraints. J Mol Biol 1993;234:779815.
protein sequence alignment algorithms with structure alignments. 36. Kabsch W, Sander C. Dictionary of protein secondary structure:
Proteins 2000;40:622. pattern recognition of hydrogen-bonded and geometrical features.
16. Prasad JC, Comeau SR, Vajda S, Camacho CJ. Consensus alignment Biopolymers 1983;22:25772637.
for reliable framework prediction in homology modeling. Bioinfor- 37. Valdar WSJ. Scoring residue conservation. Proteins 2002;48:227241.
matics 2003;19:16821691. 38. McGuffin LJ, Jones DT. Improvement of the GenTHREADER
17. Zachariah MA, Crooks GE, Holbrook SR, Brenner SE. A generalized method for fold recognition. Bioinformatics 2003;19:874881.
affine gap model significantly improves protein sequence alignment 39. Brooksbank C, Camon E, Harris MA, Magrane M, Martin MJ,
accuracy. Proteins 2005;58:329338. Mulder N, ODonovan C, Parkinson H, Tuli MA, Apweiler R, Bir-
18. Birzele F, Gewehr JE, Zimmer R. QUASARscoring and ranking of ney E, Brazma A, Henrick K, Lopez R, Stoesser G, Stoehr P,
sequence-structure alignments. Bioinformatics 2005;21:44254426. Cameron G. The European Bioinformatics Institutes data resources.
19. Madhusudhan MS, Marti-Renom MA, Sanchez R, Sali A. Variable Nucleic Acid Res 2003;31:4350.
gap penalty for protein sequence-structure alignment. Protein Eng 40. McDonald IK, Thornton JM. Satisfying hydrogen bonding potential
Des Sel 2006;19:129133. in proteins. J Mol Biol 1994;238:777793.
20. Wilson CA, Jreychman J, Gerstein M. Assessing annotation transfer 41. Jones DT, Bryson K, Coleman A, McGuffin LJ, Sadowski MI, Sodhi
for genomics: quantifying the relations between protein sequence, JS, Ward JJ. Prediction of novel and analogous folds using fragment
structure and function through traditional and probabilistic scores. assembly and fold recognition. Proteins 2005;S7:143151.
J Mol Biol 2000;297:233249. 42. Lazaridis T, Karplus M. Effective energy function for proteins in so-
21. Canutescu AA, Dunbrack RL. MolIDE (Molecular integrated devel- lution. Proteins 1999;35:133152.
opment environment): a homology modeling framework you can 43. Kuhlman B, Baker D. Native protein sequences are close to optimal
click with. Bioinformatics 2005;21:29142916. for their structures. Proc Natl Acad Sci USA 2000;97:1038310388.
22. Bujnicki JM, Elofsson A, Fischer D, Rychlewski L. LiveBench-1: 44. Laskowski RA, MacArthur MW, Moss DS, Thornton JM. PRO-
continuous benchmarking of protein structure prediction servers. CHECK: a program to check the stereochemical quality of protein
Protein Sci 2001;10:352361. structures. J Appl Crystallogr 1993;26:283291.
23. Bujnicki JM, Elofsson A, Fischer D, Rychlewski L. LiveBench-2: 45. Tsai J, Bonneau R, Morozov AV, Kuhlman B, Rohl CA, Baker D. An
large-scale automated evaluation of protein structure prediction improved protein decoy set for testing energy functions for protein
servers. Proteins 2001;S5:184191. structure prediction. Proteins 2003;53:7687.
24. Rychlewski L, Fischer D, Elofsson A. LiveBench-6: large-scale auto- 46. Canutescu AA, Shelenkov AA, Dunbrack RL. A graph theory algo-
mated evaluation of protein structure prediction servers. Proteins rithm for protein side-chain prediction. Protein Sci 2003;12:2001
2003;53(Suppl 6):542547. 2014.
25. Rychlewski L, Fischer D. LiveBench-8: the large-scale, continuous 47. Taylor WR. Protein structure comparison using iterated double
assessment of automated protein structure prediction. Protein Sci dynamic programming. Protein Sci 1999;8:654665.
2005;14:240245. 48. Zhang Y, Skolnick J. Scoring function for automated assessment of
26. Pearl F, Todd A, Sillitoe I, Dibley M, Redfern O, Lewis T, Bennett protein structure template quality. Proteins 2004;57:702710.
C, Marsden R, Grant A, Lee D, Akpor A, Maibaum M, Harrison A, 49. Contreras-Moriera B, Fitzjohn PW, Bates PA. In silico protein recom-
Dallman T, Reeves G, Diboun I, Addou S, Lise S, Johnston C, Sil- bination: enhancing template and sequence alignment selection for
lero A, Thornton J, Orengo C. The CATH domain structure data- comparative protein modelling. J Mol Biol 2003;328:593608.
base and related resources Gene3D and DHS provide comprehen- 50. Fiser A. Protein structure modeling in the proteomics era. Expert
sive domain family information for genome analysis. Nucleic Acids Rev Proteomics 2004;1:97110.
Res 2005;33:D247D251. 51. Wang GL, Dunbrack RL. Scoring profile-to-profiles sequence align-
27. Zhang Y, Skolnick J. TM-align: a protein structure alignment algo- ments. Protein Sci 2004;13:16121626.
rithm based on the TM-score. Nucleic Acids Res 2005;33:2302 52. Edgar RC, Sjolander K. A comparison of scoring functions for pro-
2309. tein sequence profile alignment. Bioinformatics 2004;20:13011308.

DOI 10.1002/prot PROTEINS 485

You might also like