You are on page 1of 6

proteins

STRUCTURE O FUNCTION O BIOINFORMATICS

A proteinRNA docking benchmark (I): Nonredundant cases


Amita Barik, Nithin C., Manasa P., and Ranjit Prasad Bahadur*
Department of Biotechnology, Indian Institute of Technology, Kharagpur 721302, India

ABSTRACT We have developed a nonredundant proteinRNA docking benchmark dataset, which is derived from the available bound and unbound structures in the Protein Data Bank involving polypeptide and nucleic acid chains. It consists of nine unboundunbound cases where both the protein and the RNA are available in the free form. The other 36 cases are of unboundbound type where only the protein is available in the free form. The conformational change upon complex formation is calculated by a distance matrix alignment method, and based on that, complexes are classified into rigid, semi-flexible, and full flexible. Although in the rigid body category, no significant conformational change accompanies complex formation, the fully flexible test cases show large domain movements, RNA base flips, etc. The benchmark covers four major groups of RNA, namely, t-RNA, ribosomal RNA, duplex RNA, and single-stranded RNA. We find that RNA is generally more flexible than the protein in the complexes, and the interface region is as flexible as the molecule as a whole. The structural diversity of the complexes in the benchmark set should provide a common ground for the development and comparison of the proteinRNA docking methods. The benchmark can be freely downloaded from the internet.
C V 2012 Wiley Periodicals, Inc.

Proteins 2012; 00:000000.

Key words: proteinRNA interaction; conformation change; proteinRNA docking.

INTRODUCTION RNA molecules perform various functional works inside the cell which include protein synthesis and targeting, many forms of RNA processing and splicing, RNA editing and modifications, and chromosome end maintenance.1 Most of these processes are governed by the specific recognition of the RNA molecules with partner proteins.2,3 Although structural studies of proteinRNA complexes have been very active in the last decade, threedimensional atomic structures of many such complexes are difficult to determine and remain comparatively few. The alternate strategy to determine the three-dimensional structure of the proteinRNA complex is docking. This is a purely predictive method based on the starting coordinates of the individual structures of the receptor and the ligand molecules. The success of this method depends on the inherent character of the biomolecules; the more flexible they are, the more difficult it is to predict their native conformation. The performance of the docking algorithms is regularly assessed in CAPRI experiments,4 resulting in a tremendous progress in their ability to predict the proteinprotein complexes.5 However, test of these methods in prediction of the proteinnucleic acid complexes, especially the proteinRNA complexes, is lim-

ited owing to the lack of their three-dimensional atomic structures.6 To facilitate the development of proteinRNA docking algorithms, we have constructed a benchmark of 45 nonredundant proteinRNA complexes in line with existing proteinprotein and proteinDNA docking benchmarks.7,8 The benchmark contains nine unbound unbound cases where unbound conformations are available from the Protein Data Bank (PDB)9 for both the protein and the RNA, and 36 unboundbound cases where only the protein structure is available in unbound conformation. The scarcity of the unbound RNA structures in the PDB is owing to their physicochemical and geometric properties that make them difficult to crystallize.10 Moreover, unlike DNA, RNA rarely exists as a long straight duplex. Its tertiary structure may comprise duplexes, single-stranded regions, hairpins, internal loops, bulges, and junctions. This benchmark dataset covers all
Grant sponsors: The DBT, the ISIRD, SRIC of IIT-Kharagpur *Correspondence to: Ranjit Prasad Bahadur, Department of Biotechnology, Indian Institute of Technology, Kharagpur 721302, India. E-mail: r.bahadur@hijli.iitkgp. ernet.in or ranjitp_bahadur@yahoo.com Received 24 October 2011; Accepted 23 March 2012 Published online 10 April 2012 in Wiley Online Library (wileyonlinelibrary.com). DOI: 10.1002/prot.24083

C V 2012 WILEY PERIODICALS, INC.

PROTEINS

A. Barik et al.

the major groups of proteinRNA complexes according to the classification of Bahadur et al.2 It contains variety of challenging systems in terms of varying sizes of the interacting surfaces, as well as the degree of conformational change upon complex formation. The structural diversity of the complexes in the benchmark set will provide a common ground for comparing different protein RNA docking algorithms, thereby benefitting the docking community. MATERIALS AND METHODS
Dataset of the proteinRNA complexes

The PDB was scanned for all entries containing both protein and RNA chains. X-ray structures with resolution better than 3.0 A or the average structure of NMR ensembles was retained. Entries with protein chains of <30 amino acids or RNA chains of <5 nucleotides were discarded. Redundancy was removed at the level of 35% sequence identity with the better resolution structure. This gave us 45 entries for proteinRNA complexes (Table I). The dataset could, in principle, be extended by considering homology models as well as experimental structures. However, we chose not to so in our benchmark, as errors introduced by the modeling step are difficult to evaluate, and the prediction of the tertiary structure of RNA still far from reliable. The unbound structures of the protein and the RNA components of the complexes were found in the following manner. For each sequence in the bound structure, a BLAST search was carried out against the nonredundant PDB with sequence identity >90%, requiring a coverage of more than 95% and an E-value close to zero (1023). When multiple candidates for an unbound structure were found, the one with highest sequence similarity and maximum alignment length with the bound structure, or with the best resolution, was kept.
Classification of the complexes

ing on the value of i-rmsd, we divided the benchmark into three different categories, namely, rigid body (i-rmsd < 1.5 A), semi-flexible (1.5 A  i-rmsd < 3.0 A) and full flexible (i-rmsd  3.0 A). The cutoff values are those of the proteinprotein docking benchmark, where they are based on the success rate of rigid-body docking.7 Although we made no such test on the present benchmark, we expect proteinRNA docking procedures to behave similarly. The size of the proteinRNA interfaces was estimated by measuring the area of the solvent accessible surface (ASA) buried in the contact. We define the buried surface area (B) as the sum of the solvent ASA of the two components less than that of the complex. ASA values were measured with the program NACCESS,12 which implements the Lee and Richards algorithm13 with a probe radius of 1.4 A and default group radii.

RESULTS AND DISCUSSION Version 1.0 of the proteinRNA docking benchmark consists of 45 test cases (Table I): (A) 16 complexes with tRNA, (B) 3 with ribosomal protein, (C) 10 with duplex RNA and (D) 16 with single-stranded RNA. Amongst them, nine are unboundunbound cases where structures of both the protein and the RNA are available in the free form, and 36 are unboundbound cases where only the protein structure is available in the free form. In addition, based on the i-rmsd values, the benchmark set is further divided into three different categories (defined in MATERIALS AND METHODS section). Accordingly, we have 34 rigid body complexes (R) where the conformational change upon complexation is very low, eight semi-flexible complexes (S) where the interface region undergoes significant conformational change, and three full-flexible complexes (X) where the interface undergoes large conformational change. The average i-rmsd of the nine unboundunbound complexes is 0.97 A for the pro with the RNA included. The tein alone and 2.31 A increase shows the inherent flexibility of the RNA molecules at the binding site, confirmed by the average c-rmsd (1.3 A) and p-rmsd (4.2 A). The largest change concerns snRNA bound to the spliceosomal protein (1e7k). Here, the protein interacts extensively with a purine-rich internal loop within the 50 stem-loop, giving an unusual RNA fold14 characterized by two tandem sheared G-A base pairs [Fig. 1(A)]. The large conformational change found in the RNA stem-loop is possibly owing to its involvement in a conformational switch that regulates spliceosomal function in vivo.15 The scarcity of the unbound structures of the RNA restricted us to quantify the conformational changes only for the protein components in the 36 unboundbound cases. Here, the average i-rmsd (calculated only over the interface Ca atoms) is 1.3 A, and is very close to the av-

The structural classification of the proteinRNA complexes is taken from Bahadur et al.2 The conformational changes between the bound and the unbound states are quantified by the root mean squared displacement (rmsd), calculated using the distance matrix alignment method implemented in DaliLite.11 For each polypeptide chain, we calculated the displacement of all the equivalent Ca atoms between the superposed bound and the unbound structures, and called it c-rmsd. When the unbound structure of the RNA was available, a p-rmsd value was calculated in a similar way over the superposed equivalent backbone phosphorus (P) atoms. We also calculated an interface rmsd (i-rmsd) considering only the equivalent Ca atoms of the interface amino acids and the equivalent P atoms of the interface nucleotides. Depend-

PROTEINS

Table I
Complex RNA Protein RNA Interface area Bb(2) i-rmsd
c

The Benchmark Dataset of Nonredundant Protein-RNA Complexes Unbound (PDB id)a RMSD () c-rmsdd p-rmsde Categoryf

PDB id

Protein

1asy (A:R) tRNA (Phe) tRNA (fMet) 1fmt (A) 3cw6 (A) 2940 1.2 1eft (A) 4tna (A) 2890 0.7

Aspartyl-tRNA synthetase

tRNA (Asp)

I. Unbound-Unbound (9) (A) Complexes with tRNA (3) 1eov (A) 2tra (A) 4430 1.5

4.9 2.5 2.3

R R R

1ttt (A:D)

Elongation factor EF-TU

2fmt (A:C)

tRNA-fMet transformylase

1.3 2.3 0.7 1.3 0.9 1.7 3.0 3.7 1.0 4.0 3.0

1dfu (P:MN)

Ribosomal protein L25

5S rRNA

(B) Ribosomal proteins (1) 1b75 (A) 364d (ABC) 1690

5.3

1e7k (A:C)

Spliceosomal protein 15.5K

U4 snRNA

(C) Duplex RNA (1) 2jnb (A) 1mfj (A) 1300

2.1

6.2

1jbs (A:C) SECIS RNA Histone mRNA SRE hairpin RNA 2d3d (A) 2b7g (A) 1zbu (AD) 1ju7 (A) 1lva (A) 2rlu (A)

Sarcin-like cytotoxin restrictocin

29-mer SRD RNA analog

(D) Single-stranded RNA (4) 1aqz (A) 1q9a (A)

1310 940 1760 483

0.7 0.7 0.6 0.9

3.4 4.2 3.6 2.5

R R R R

1wsu (A:E)

Elongation factor SelB

ProteinRNA Docking

1zbh (AD:E)

30 -Endonuclease Eri1

2b6g (A:B)

Vts1p

0.6 1.9 0.5 0.8 0.6 2.0 0.3 0.9

1c0a (A:B) 1f7u (A:B) 1h4s (AB:T) 1j1u (AA':B) 1n78 (A:C) 1qf6 (A:B) 1qtq (A:B) 1ser (AB:T) 1u0b (B:A) 2azx (AA':C) 2bte (A:B) 2drb (A:B) 2fk6 (A:R)

Aspartyl-tRNA synthetase Arginyl-tRNA synthetase Prolyl-tRNA synthetase Tyrosyl-tRNA synthetase Glutamyl-tRNA synthetase Threonyl-tRNA synthetase. Glutaminyl-tRNA synthetase Seryl-tRNA synthetase Cysteinyl-tRNA synthetase Tryptophanyl-tRNA synthetase Leucyl-tRNA synthetase CCA-adding enzyme RNase Z

tRNA tRNA tRNA tRNA tRNA tRNA tRNA tRNA tRNA tRNA tRNA tRNA tRNA

(Asp) (Arg) (Pro) (Tyr) (Glu) (Thr) (Gln) (Ser) (Cys) (Trp) (Leu) (35-mer) (Thr)

II. Unbound-Bound (36) (A) Complexes with tRNA (13) 1eqr (A) 1bs2 (A) 1hc7 (AB) 1u7d (AB) 1j09 (A) 1evl (A) 1nyl (A) 1ses (AB) 1li5 (B) 1r6u (AB) 1obc (A) 1uet (A) 1y44 (A)

4180 5140 2480 2240 4510 4230 5200 2290 4560 2130 3430 3200 1530

1.6 2.2 0.9 0.8 1.4 0.7 1.8 2.4 1.0 0.6 1.3 1.8 0.8

1.6 2.0 1.0 1.3 1.9 0.8 1.6 1.7 0.7 0.8 1.2 1.1 0.7

S S R R R R S S R R R S R (Continued)

PROTEINS

4
Complex
cd de

Table I
Unbound (PDB id)a RMSD () p-rmsdef i-rmsd 0.4 5.1 3.5 3.3 0.6 1.3 0.6 1.0 0.8 1.5 1.0 2610 1770 1290 2110 3240 1360 720 2320 2240 4040 2000 4160 1.3 0.9 0.4 0.7 0.9 0.2 1.6 0.8 0.8 0.7 0.9 0.9 3.5 2.0 0.7 1.4 0.6 1.3 1.0 1.9 1.0 1.9 1.2 0.6 0.8 0.5 0.2 1.0 1.1 0.9 1.4 1.3 1.6 0.3 6.7 c-rmsd Categoryfc R X X X R R R R R S R R R R R R R R R R R R S

PROTEINS

(Continued)

PDB id box H/ACA sRNA mRNA (C) Duplex RNA (9) 1aro (P) 3500 3210 1970 5190 990 2620 3280 3080 1830 1ze1 (A) 1u09 (A) 1yvr (A) 1zbf (A) 2b9z (AB) 1jfz (AB) 1yvu (A) 1r0v (AB) (B) Ribosomal proteins (2) 1xbi (A) 1ad2 (A) 1200 2330

Protein

RNA

Protein

RNA

Interface area Bb(2)

1sds (C:FF'0 ) 2hw8 (A:B)

Ribosomal protein L7Ae Ribosomal protein L1

1msw (D:R)

RNA polymerase, phage T7

1r3e (A:CDE) 1wne (A:BC)

Pseudo-U synthetase TruB RNA polymerase, FMD virus

1yvp (B:EFH) 1zbi (A:CD) 2az0 (AB:CD) 2ez6 (AB:CD) 2f8s (A:CD) 2gjw (AB:EFH) T stem-loop RNA Hairpin ribozyme Uridine heptamer Nre1-19 RNA dsRNA Hut mRNA Cytosine-rich RNA BoxC rRNA siRNA 23S rRNA Viral genomic RNA Single-stranded RNA (D) Single-stranded RNA (12) 1r3f (A) 1oia (A) 1h64 (AM) 1m8z (A) 1muk (A) 1wpv (A) 1a8v (B) 1k0r (A) 1w9h (A) 1uwv (A) 2qvj (A) 2ix0 (A)

Ro autoantigen RNase H Flock House virus protein B2 RNase III Argonaute protein Splicing endonuclease

17-nucleotide RNA transcript tRNA fragment Template-primer RNA decanucleotide Y RNA A form RNA siRNA dsRNA siRNA bulge-helix-bulge RNA

A. Barik et al.

1k8w (A:B) 1m5o (C:B) 1m8v (AM:O) 1m8w (A:C) 1n35 (A:BC) 1wpu (A:C) 2a8v (B:E) 2asb (A:B) 2bgg (A:PQ) 2bh2 (A:C) 2gic (A:R) 2ix1 (A:B)

Pseudo-U synthetase TruB U1 SnpA ribozyme Sm -protein PH domain RNA polymerase lambda3, reovirus Hutp antiterminator RHO transcription termination factor NusA antiterminator PIWI protein Methyltransferase RumA VSV nucleocapsid RNase II

a Four-letter PDB code of the proteinRNA complexes used in the data set with the chain ID(s) of the protein and the RNA molecules in the parentheses. For structures solved with NMR spectroscopy (shown in bold), the closest to the average structure of the entire ensemble was considered as the reference structure for structural alignment. Symmetry-related chains are primed (e.g., A0 in 1j1u). b Surface area buried between protein and RNA upon complexation. RMSD values were calculated over the superposed equivalent Ca (or P) atoms between the bound and unbound structures where, c i-rmsd is calculated considering only the interface Ca, and the values in italics include the phosphorus atoms of the interface nucleotides when the corresponding free RNA structure is available. d c-rmsd is calculated over all the Ca, atoms of a given protein chain; e p-rmsd is calculated over all the phosphorus atoms of a given RNA chain. In case of the NMR structures, RMSD values were calculated based on the average structure of the entire ensemble. f Different categories according to the expected difficulty for the protein-RNA docking algorithm: (R) Rigid body, (S)Semi flexible and (X) Full flexible.

ProteinRNA Docking

erage c-rmsd of 1.4 A, with a correlation coefficient of 0.90 between them. This implies that the RNA binding site has the same flexibility as the molecule as a whole. This is evident in the superposed bound and unbound structures of pseudouridine synthase (Fig. 1(B), PDB id:

Table II
Statistics of the Conformation Change in Different Classes Structural class Average B (2) Average rmsd () i-rmsda c-rmsd p-rmsd
a b

t-RNA 3461 1.3 1.2 3.4

Ribosomal 1740 2.8 3.3 5.3b

Duplex 2697 1.5 1.5 6.2b

Single-stranded 2022 0.8 1.0 3.4

The average in different classes is calculated only over the interface Ca atoms. Only one case is available in the data set.

1r3e). The enzyme undergoes a significant conforma tional change (i-rmsd 3.4 A) on binding to its substrate RNA molecule. The largest change concerns a thumb loop, which goes from disordered to ordered as it interacts with the RNA hairpin loop.16 On the other hand, the ribosomal protein L1 undergoes large domain movements upon binding to RNA (Table I and Fig. 1(C), PDB id: 2hw8) through an induced fit mechanism,17,18 which affects the whole protein as well as the region in contact with the RNA. Between the different structural classes of protein RNA complexes, the proteins that bind single-stranded RNA are the least flexible, and those binding ribosomal RNAs are the most (Table II). The average i-rmsd and crmsd values are similar except in proteins binding ribosomal RNAs where the interface is less mobile than the chain as a whole. The interface area (B) in this dataset covers a wide range from 483 to 5200 A2 (Table I). The smallest interface occurs in the NMR structure of the Vts1pSRE complex (483 A2), where the Sterile Alpha Motif domain of Vts1P interacts with the 19-nucleotide SRE RNA, and the high degree of solvent exchange prohibited to observe NOEs from guanidino and amino groups of arginine and lysine.19 On the other hand, the largest interface is observed in Glutaminyl-TRNA synthe tase bound to its tRNA (5200 A2). The correlation between the interface area and the i-rmsd (R2 5 0.12) or the c-rmsd (R2 5 0.05) is very poor. This suggests that the flexibility in the proteinRNA recognition process is independent of the size of the interacting surfaces. This is also evident in proteinprotein and proteinDNA recognition processes.20,21 CONCLUSIONS The proteinRNA docking benchmark version 1.0 consists of a wide variety of complexes including examples of rigid body association as well as flexible movements of both polypeptide and nucleic acid chains upon complexation. It should provide the docking community with a starting point to test their prediction method for proteinRNA complexes. The major challenge in making this dataset is the scarcity of the RNA structures in free form. We believe that the present growing interest in the field
PROTEINS

Figure 1
Test cases in the proteinRNA benchmark dataset. (A) The spliceosomal 15.5-kD protein binding a U4 snRNA fragment (PDB id: 1e7k, structural class: C, difficulty: R). (B) Pseudouridine synthase TruB binding its substrate tRNA (PDB id: 1r3e, structural class: C, difficulty: X), and (C) ribosomal protein L1mRNA complex (PDB id: 2hw8, structural class: B, difficulty: X). In each case, the bound form of the protein is in green, the unbound form in cyan, and RNA in grey.

A. Barik et al.

of RNA research will yield many new structures of proteinRNA complexes and their components, and allow us to update the benchmark in a timely manner. Version 1.0 can be downloaded from http://www.facweb.iitkgp. ernet.in/rbahadur/benchmark.html. ACKNOWLEDGMENTS The authors are grateful to J. Janin for critical reading of the manuscript. A.B. is thankful to the IIT-Kharagpur for her research fellowship. N.C. and M.P. acknowledge the support from the DBT. R.P.B. is thankful to the ISIRD, SRIC of IIT Kharagpur for the startup grant. The dataset is hosted in the web-server of CIC in IIT-Kharagpur. REFERENCES
1. Doudna JA. Structural genomics of RNA. Nat Struct Mol Biol 2000;7:954956. 2. Bahadur RP, Zacharias M, Janin J. Dissecting proteinRNA recognition sites. Nucleic Acids Res 2008;36:27052716. 3. Gupata A, Gribskov M. The role of RNA sequence and structure in RNA-protein interactions. J Mol Biol 2011;409:574587. 4. Janin J. The targets of CAPRI Rounds 1319. Proteins 2010;78: 30673072. 5. Ritchie DW. Recent progress and future directions in protein-protein docking. Curr Prot Pept Sci 2008;9:115. 6. Perez-Cano L, Fernandez-Recio J. Optimal Protein-RNA Area, OPRA: a propensity-based method to identify RNA-binding sites on proteins. Proteins 2010;78:2535. 7. Hwang H, Pierce B, Mintseris J, Janin J, Weng Z. ProteinProtein docking benchmark version 3.0. Proteins 2008;73:705709. 8. Van Dijk M, Bonvin AMJJ. A proteinDNA docking benchmark. Nucleic Acids Res 2008;36:e88. 9. Berman HM, Battistuz T, Bhat TN, Bluhm WF, Bourne PE, Burkhardt K, Feng Z, Gilliland GL, Iype L, Jain S, Fagan P, Marvin J,

10. 11. 12.

13. 14.

15.

16.

17.

18.

19. 20. 21.

Ravichandran V, Schneider B, Thanki N, Padilla D, Weissig H, Westbrook JD, Zardecki C. The Protein Data Bank. Acta Cryst 2002;D58:899907. Ke A, Doudna JA. Crystallization of RNA and RNAprotein complexes. Methods 2004;34:408414. Holm L, Park J. DaliLite workbench for protein structure comparison. Bioinformatics 2000;16:566567. Hubbard SJ. NACCESS: program for calculating accessibilities. Department of Biochemistry and Molecular Biology, University College of London, 1992. Lee B, Richards FM. The interpretation of protein structures: estimation of static accessibility. J Mol Biol 1971;55:379400. Vidovic I, Nottrott S, Hartmuth K, Luhrmann R, Ficner R. Crystal structure of the spliceosomal 15.5kD protein bound to a U4 snRNA fragment. Mol Cell 2000;6:13311342. Comolli LR, Ulyanov NB, Soto AM, Marky LA, James TL, Gmeiner WH. NMR structure of the 30 stem-loop from human U4 snRNA. Nucleic Acids Res 2002;30:43714379. Pan H, Agarwalla S, Moustakas DT, Finer-Moore J, Stroud RM. Structure of tRNA pseudouridine synthase TruB and its RNA complex: RNA recognition through a combination of rigid docking and induced fit. Proc Natl Acad Sci USA 2003;100:1264812653. Tishchenko S, Nikonova E, Nikulin A, Nevskaya N, Volchkov S, Piendl W, Garber M, Nikonov S. Structure of the ribosomal protein L1-mRNA complex at 2.1 A resolution: common features of crystal packing of L1-RNA complexes. Acta Cryst 2006;D62:15451554. Unge J, Al-Karadaghi S, Liljas A, Jonsson BH, Eliseikina I, Ossina N, Nevskaya N, Fomenkova N, Garber M, Nikonov S. A mutant form of the ribosomal protein L1 reveals conformational flexibility. FEBS Lett 1997;411:5359. Johnson PE, Donaldson LW. RNA recognition by the Vts1p SAM domain. Nat Struct Mol Biol 2006;13:177178. Janin J, Bahadur RP, Chakrabarti P. Protein-protein interaction and quaternary structure. Quart Rev Biophys 2008;41:133180. Janin J, Bahadur RP. Relating macromolecular function and association: the structural basis of proteinDNA and RNA recognition. Cell Mol Bioeng 2008;1:327338.

PROTEINS

You might also like