Professional Documents
Culture Documents
ABSTRACT INTRODUCTION
Comparative modeling is presently the most Homology modeling remains the only practical method of predict-
accurate method of protein structure prediction. ing protein structure with high accuracy. Despite many years of
Previous experiments have shown the selection intense research effort into the prediction of structures from estab-
of the correct template to be of paramount im- lished physical principles, the best way to predict the structure of a
portance to the quality of the final model. We protein is still to search for a protein of known structure, which is
have derived a set of 732 targets for which a
likely to adopt the same fold.1
choice of ten or more templates exist with 30
The basis of this method is the observation by Chothia and Lesk2
80% sequence identity and used this set to com-
pare a number of possible methods for template that closely homologous proteins almost invariably share the same
selection: BLAST, PSI-BLAST, profileprofile overall structure, and that the degree of core structural similarity can
alignment, HHpred HMMHMM comparison, be quantified as a function of the identity between their sequences.
global sequence alignment, and the use of a Subsequent work has refined our understanding of this relationship35
model quality assessment program (MQAP). In to account for improved methods for assessing sequence similarity;
addition, we have investigated the question of however, the overall functional form remains the same.
whether any structurally defined subset of the Generating a template-based model is generally separated into the
sequence could be used to predict template qual- following steps: template identification, target-template alignment,
ity better than overall sequence similarity. We model generation, and model refinement.6,7 Experience of blind
find that template selection by BLAST is suffi-
modeling tasks in the CASP experiments813 has shown that template
cient in 75% of cases but that there are exam-
selection is still the most crucial step to the production of a high
ples in which improvement (global RMSD 0.5 A
or more) could be made. No significant quality model and that only in rare cases can current methods pro-
improvement is found for any of the more so- duce models exceeding the template-target similarity.12,13
phisticated sequence-based methods of template It follows that the two most crucial steps for a high-quality final
selection at high sequence identities. A subset of model are the selection of the best possible template and generation
118 targets extending to the lowest levels of of the correct alignment of the target sequence to the template struc-
sequence similarity was examined and the ture. The question of generating a correct sequencestructure align-
HHpred and MQAP methods were found to ment for modeling is a well-studied one, and many techniques have
improve ranking when available templates had been proposed for this problem.1419 Conversely, the question of
3540% maximum sequence identity. Structur- how to select the best template from a set of alternatives has not pre-
ally defined subsets in general are found to be
viously received much attention.
less discriminative than overall sequence simi-
The reason that this question arises is that although the general
larity, with the coil residue subset performing
equivalently to sequence similarity. Finally, we form of the sequencestructure relationship is well understood and
demonstrate that if models are built and model follows the results of Chothia and Lesk,2 there is a high variance in
quality is assessed in combination with the the structural similarity (Fig. 1). Once there is a choice of possible
sequence-template sequence similarity that a
extra 7% of best models can be found.
Grant sponsor: Biosapiens Network of Excellence (funded by the European Commission within its
Proteins 2007; 69:476485. FP6 Programme, under the thematic area Life sciences, Genomics and Biotechnology for Health);
C 2007 Wiley-Liss, Inc.
V
Grant number: LSHG-CT-2003-503265.
*Correspondence to: D.T. Jones, Bioinformatics Unit, Department of Computer Science, University
Key words: protein structure prediction; homol- College London, London WC1E 6BT, United Kingdom. E-mail: d.jones@cs.ucl.ac.uk
Received 10 October 2006; Revised 23 February 2007; Accepted 20 March 2007
ogy modeling; bioinformatics; profile-profile Published online 10 July 2007 in Wiley InterScience (www.interscience.wiley.com).
alignment; high-resolution modeling; MQAP. DOI: 10.1002/prot.21531
Benchmark set
similar and we observed that TMAlign generated very those found in highly resolved crystal structures.
similar alignments and RMSD values to SAP for this set. Rather than using fixed bins, probabilities of particu-
lar main chain and side chain conformations were
obtained by calculating the mean absolute difference
Environmental factors
across all corresponding angles between the target resi-
To follow up the suggestion that biophysical factors due and all of the residues of the same type in the
relating to the crystallization of the protein may be im- high resolution set. A threshold of 108 was found to
portant, we assessed whether differing pH, temperature, be optimal. Any residue in the data set with torsion
quaternary state, resolution, and space group might have angles within this threshold was counted as a positive
an effect on the choice of template. We derived the reso- observation. For side chains, both the main chain and
lution, pH, temperature, and space group information side chain torsion angles are compared to ensure that
from the REMARK 2, 200, 200, and 290 records of the the rotamer probabilities are conditional on the main
original PDB files, respectively. Quaternary structure in- chain conformation. The logarithms of the observed
formation was derived from the PQS database.39 relative frequencies were used to provide additive
These factors were then combined with sequence iden- energy-like terms.
tity values to determine whether any beneficial effects on 5. van der Waals interactions: We calculate a standard
the ranking were obtained. For quaternary structure and LennardJones potential for all nonbonded atoms in
space group information, we determined if the values the model. Attractive and repulsive terms are summed
were identical, different, or missing. Values of 1, 0, and separately and the potential is softened at close
0.5 were assigned in each case respectively. Temperature atom separations as proposed by Kuhlman et al.43
and pH values were assessed as the difference between 6. Stereochemistry: Finally, we take the summary scores
the target and template values. Where one or both values generated by Procheck44 which evaluate the overall
were missing a value of zero was assigned. Resolution in- stereochemical quality of the model.
formation for the template structure was used to mark
each pair. In each case, we added or subtracted (as
The above methods produced a total of 30 features
appropriate) the given value times a weighting factor to
and to calculate a single overall score, each of the 30
the sequence identity score. Weights were varied over a
terms were combined linearly and assigned variable
large range (34 orders of magnitude) to determine
weights, which were optimized using a nongradient-based
whether any value would effect an improvement.
pattern search optimization algorithm over the decoy set
of Tsai et al.45 The target function in this case was the
High resolution model quality assessment SSE (sum of squared error) between the weighted sums
of terms and the RMSDs between the decoy structure
To try to bring in as much structural information as
and the experimentally observed structure. Since this
possible to the quality assessment of final models, we
method involves assessment of hydrogen bonding quality,
combined the following broad range of component fea-
we also assessed models with additional side chain opti-
tures to form a new MQAP called high resolution model
mization using SCWRL.46
quality assessment (MODCHECK-HD):
Table I
Composition of the Dataset
Figure 5 Figure 6
Comparison of BLAST and HHpred performance for template sets with Conservation of accessibility defined sequence subsets as predictors of template
increasing evolutionary distance. Cumulative proportion of the 118-member set similarity. The axes are as in Figure 2. Residues were divided into bins by
of targets with low sequence identity templates (see text) correctly identified with solvent accessibility in bands of 10%. The conservation of each bin was
d-RMSD within 1.5 A is plotted against the maximum sequence identity of any calculated using a conservation metric based on the BLOSUM 62 matrix.
template on the x-axis. Structural alignments were used to minimize alignment error. For comparison,
the performance of the whole sequence is also plotted.
Subset measures
7. Krieger E, Nabuurs SB, Vriend G. Homology modeling. In: Bourne 28. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W,
PE, Weissig H, editors. Structural Bioinformatics. New York: Wiley- Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of
Liss; 2003. pp 507521. protein database search programs. Nucleic Acids Res 1997;25:3389
8. Mosimann S, Meleshko R, James MNG. A critical assessment of 3402.
comparative molecular modeling of tertiary structures of proteins. 29. Smith TF, Waterman MS. Identification of common molecular sub-
Proteins 1995;23:301317. sequences. J Mol Biol 1981;147:195197.
9. Martin ACR, MacArthur MW, Thornton JM. Assessment of com- 30. Henikoff S, Henikoff JG. Amino-acid substitution matrices from
parative modeling in CASP2. Proteins 1997;S1:1428. protein blocks. Proc Natl Acad Sci USA 1992;89:1091510919.
10. Jones TA, Kleywegt GJ. CASP3 comparative modeling evaluation. 31. Marti-Renom MA, Madhusudhan MS, Sali A. Alignment of protein
Proteins 1999;S3:3046. sequences by their profiles. Protein Sci 2004;13:10711087.
11. Tramontano A, Leplae R, Morea V. Analysis and assessment of com- 32. Ohlson T, Wallner B, Elofsson A. Profile-profile methods provide
parative modeling predictions in CASP4. Proteins 2001;S5:2238. improved fold-recognition: a study of different profile-profile align-
12. Tramontano A, Morea V. Assessment of homology-based predic- ment methods. Proteins 2004;57:188197.
tions in CASP5. Proteins 2003;53:352368. 33. Soding J. Protein homology detection by HMM-HMM comparison.
13. Tress M, Ezkurdia I, Grana O, Lopez G, Valencia A. Assessment of Bioinformatics 2005;21:951960.
predictions submitted for the CASP6 comparative modelling cate- 34. Pettitt CS, McGuffin LJ, Jones DT. Improving sequence-based fold
gory. Proteins 2005;S7:2745. recognition by using 3D model quality assessment. Bioinformatics
14. Alexandrov NN, Luethy R. Alignment algorithm for homology 2005;21:35093515.
modeling and threading. Protein Sci 1998;7:254258. 35. Sali A, Blundell TL. Comparative protein modelling by satisfaction
15. Sauder JM, Arthur JW, Dunbrack RL. Large-scale comparison of of spatial restraints. J Mol Biol 1993;234:779815.
protein sequence alignment algorithms with structure alignments. 36. Kabsch W, Sander C. Dictionary of protein secondary structure:
Proteins 2000;40:622. pattern recognition of hydrogen-bonded and geometrical features.
16. Prasad JC, Comeau SR, Vajda S, Camacho CJ. Consensus alignment Biopolymers 1983;22:25772637.
for reliable framework prediction in homology modeling. Bioinfor- 37. Valdar WSJ. Scoring residue conservation. Proteins 2002;48:227241.
matics 2003;19:16821691. 38. McGuffin LJ, Jones DT. Improvement of the GenTHREADER
17. Zachariah MA, Crooks GE, Holbrook SR, Brenner SE. A generalized method for fold recognition. Bioinformatics 2003;19:874881.
affine gap model significantly improves protein sequence alignment 39. Brooksbank C, Camon E, Harris MA, Magrane M, Martin MJ,
accuracy. Proteins 2005;58:329338. Mulder N, ODonovan C, Parkinson H, Tuli MA, Apweiler R, Bir-
18. Birzele F, Gewehr JE, Zimmer R. QUASARscoring and ranking of ney E, Brazma A, Henrick K, Lopez R, Stoesser G, Stoehr P,
sequence-structure alignments. Bioinformatics 2005;21:44254426. Cameron G. The European Bioinformatics Institutes data resources.
19. Madhusudhan MS, Marti-Renom MA, Sanchez R, Sali A. Variable Nucleic Acid Res 2003;31:4350.
gap penalty for protein sequence-structure alignment. Protein Eng 40. McDonald IK, Thornton JM. Satisfying hydrogen bonding potential
Des Sel 2006;19:129133. in proteins. J Mol Biol 1994;238:777793.
20. Wilson CA, Jreychman J, Gerstein M. Assessing annotation transfer 41. Jones DT, Bryson K, Coleman A, McGuffin LJ, Sadowski MI, Sodhi
for genomics: quantifying the relations between protein sequence, JS, Ward JJ. Prediction of novel and analogous folds using fragment
structure and function through traditional and probabilistic scores. assembly and fold recognition. Proteins 2005;S7:143151.
J Mol Biol 2000;297:233249. 42. Lazaridis T, Karplus M. Effective energy function for proteins in so-
21. Canutescu AA, Dunbrack RL. MolIDE (Molecular integrated devel- lution. Proteins 1999;35:133152.
opment environment): a homology modeling framework you can 43. Kuhlman B, Baker D. Native protein sequences are close to optimal
click with. Bioinformatics 2005;21:29142916. for their structures. Proc Natl Acad Sci USA 2000;97:1038310388.
22. Bujnicki JM, Elofsson A, Fischer D, Rychlewski L. LiveBench-1: 44. Laskowski RA, MacArthur MW, Moss DS, Thornton JM. PRO-
continuous benchmarking of protein structure prediction servers. CHECK: a program to check the stereochemical quality of protein
Protein Sci 2001;10:352361. structures. J Appl Crystallogr 1993;26:283291.
23. Bujnicki JM, Elofsson A, Fischer D, Rychlewski L. LiveBench-2: 45. Tsai J, Bonneau R, Morozov AV, Kuhlman B, Rohl CA, Baker D. An
large-scale automated evaluation of protein structure prediction improved protein decoy set for testing energy functions for protein
servers. Proteins 2001;S5:184191. structure prediction. Proteins 2003;53:7687.
24. Rychlewski L, Fischer D, Elofsson A. LiveBench-6: large-scale auto- 46. Canutescu AA, Shelenkov AA, Dunbrack RL. A graph theory algo-
mated evaluation of protein structure prediction servers. Proteins rithm for protein side-chain prediction. Protein Sci 2003;12:2001
2003;53(Suppl 6):542547. 2014.
25. Rychlewski L, Fischer D. LiveBench-8: the large-scale, continuous 47. Taylor WR. Protein structure comparison using iterated double
assessment of automated protein structure prediction. Protein Sci dynamic programming. Protein Sci 1999;8:654665.
2005;14:240245. 48. Zhang Y, Skolnick J. Scoring function for automated assessment of
26. Pearl F, Todd A, Sillitoe I, Dibley M, Redfern O, Lewis T, Bennett protein structure template quality. Proteins 2004;57:702710.
C, Marsden R, Grant A, Lee D, Akpor A, Maibaum M, Harrison A, 49. Contreras-Moriera B, Fitzjohn PW, Bates PA. In silico protein recom-
Dallman T, Reeves G, Diboun I, Addou S, Lise S, Johnston C, Sil- bination: enhancing template and sequence alignment selection for
lero A, Thornton J, Orengo C. The CATH domain structure data- comparative protein modelling. J Mol Biol 2003;328:593608.
base and related resources Gene3D and DHS provide comprehen- 50. Fiser A. Protein structure modeling in the proteomics era. Expert
sive domain family information for genome analysis. Nucleic Acids Rev Proteomics 2004;1:97110.
Res 2005;33:D247D251. 51. Wang GL, Dunbrack RL. Scoring profile-to-profiles sequence align-
27. Zhang Y, Skolnick J. TM-align: a protein structure alignment algo- ments. Protein Sci 2004;13:16121626.
rithm based on the TM-score. Nucleic Acids Res 2005;33:2302 52. Edgar RC, Sjolander K. A comparison of scoring functions for pro-
2309. tein sequence profile alignment. Bioinformatics 2004;20:13011308.