You are on page 1of 8

http://genomebiology.com/2001/2/3/research/0007.

Research

The DNA-repair protein AlkB, EGL-9, and leprecan define new families of 2-oxoglutarate- and iron-dependent dioxygenases
L Aravind and Eugene V Koonin
Address: National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA. Correspondence: L Aravind. E-mail: aravind@ncbi.nlm.nih.gov

comment reviews

Published: 19 January 2001 Genome Biology 2001, 2(3):research0007.10007.8 The electronic version of this article is the complete one and can be found online at http://genomebiology.com/2001/2/3/research/0007 2001 Aravind and Koonin, licensee BioMed Central Ltd (Print ISSN 1465-6906; Online ISSN 1465-6914)

Received: 17 November 2000 Revised: 14 December 2000 Accepted: 12 January 2001

reports

Abstract
Background: Protein fold recognition using sequence profile searches frequently allows prediction of the structure and biochemical mechanisms of proteins with an important biological function but unknown biochemical activity. Here we describe such predictions resulting from an analysis of the 2-oxoglutarate (2OG) and Fe(II)-dependent oxygenases, a class of enzymes that are widespread in eukaryotes and bacteria and catalyze a variety of reactions typically involving the oxidation of an organic substrate using a dioxygen molecule. Results: We employ sequence profile analysis to show that the DNA repair protein AlkB, the extracellular matrix protein leprecan, the disease-resistance-related protein EGL-9 and several uncharacterized proteins define novel families of enzymes of the 2OG-Fe(II) oxygenase superfamily. The identification of AlkB as a member of the 2OG-Fe(II) oxygenase superfamily suggests that this protein catalyzes oxidative detoxification of alkylated bases. More distant homologs of AlkB were detected in eukaryotes and in plant RNA viruses, leading to the hypothesis that these proteins might be involved in RNA demethylation. The EGL-9 protein from Caenorhabditis elegans is necessary for normal muscle function and its inactivation results in resistance against paralysis induced by the Pseudomonas aeruginosa toxin. EGL-9 and leprecan are predicted to be novel protein hydroxylases that might be involved in the generation of substrates for protein glycosylation. Conclusions: Here, using sequence profile searches, we show that several previously undetected protein families contain 2OG-Fe(II) oxygenase fold. This allows us to predict the catalytic activity for a wide range of biologically important, but biochemically uncharacterized proteins from eukaryotes and bacteria.

deposited research refereed research interactions

Background
2-Oxoglutarate (2OG)- and .e(II)-dependent dioxygenases are widespread in eukaryotes and bacteria and catalyze a variety of reactions typically involving the oxidation of an organic substrate using a dioxygen molecule [1,2]. One wellstudied reaction catalyzed by such enzymes is the hydroxylation of proline and lysine sidechains in collagen and other animal glycoproteins [3-5]. In plants, enzymes of this family

catalyze the formation of the plant hormone ethylene by oxidative desaturation of 1-aminocyclopropane-1-carboxylate, and catalyze the hydroxylation and desaturation steps in the synthesis of other plant hormones, pigments and metabolites such as gibberellins, anthocyanidins and flavones [1,6,7]. In bacteria and fungi, several members of this family participate in the desaturative cyclization and oxidative ring expansion reactions in the biosynthesis of antibiotics such as

information

2 Genome Biology

Vol 2 No 3

Aravind and Koonin

penicillin and cephalosporin [8-10]. The details of the catalytic mechanism of these enzymes have been revealed by determination of the crystal structures of isopenicillin N synthase (IPNS), deacetoxycephalosporin C synthase (DAOCS) and clavaminic acid synthase (CAS) [8-11]. These structures showed that the catalytic core of the proteins consists of a double-stranded >-helix (DSBH) fold containing a HX[DE] dyad (where X is any amino acid) and a conserved carboxyterminal histidine which together chelate a single iron atom. The substrates are bound within a spacious cavity formed by the interior of the DSBH (see the Structural Classification of Proteins (SCOP) [12]). We use sequence profile analysis [13,14] to show that the DNA-repair protein AlkB, the extracellular matrix protein leprecan and the disease-resistance-related protein EGL-9 define new families of the 2OG-.e(II) dioxygenase superfamily. AlkB is widely represented in bacteria and eukaryotes and has an important role in countering the toxic DNA modifications caused by alkylating agents in both Escherichia coli and Homo sapiens [15-17]. Despite considerable effort, the precise biochemical mechanisms of its action in DNA repair remain unknown. Recent studies have shown that AlkB is required for specifically processing lesions resulting from the alkylation of single-stranded (ss) DNA [18]. Our findings predict an unusual role for this enzyme in oxidative detoxification of DNA damage. The EGL-9 protein from Caenorhabditis elegans is necessary for normal muscle function, and its inactivation results in strong resistance to paralysis induced by the Pseudomonas aeruginosa toxin [19]. We predict that EGL-9 is a novel hydroxylase that could elicit its action through the modification of sidechains of intracellular proteins. Similarly, we

show that the animal extracellular matrix protein leprecan [20] defines a hitherto unknown family of protein hydroxylases that might be involved in the generation of substrates for protein glycosylation.

Results and discussion


The 2OG-Fe(II) dioxygenase protein superfamily: classification and functional prediction
The Non-redundant Protein Sequence Database (NCBI) [21] was searched using the PSI-BLAST program [22] run to convergence, with a profile-inclusion threshold of 0.01 and AlkB protein sequences from various organisms as queries. In addition to the AlkB orthologs, these searches retrieved from the database, with statistically significant expectation (e) values, several other more distant homologs of AlkB, including uncharacterized eukaryotic proteins and fragments of the polyproteins of plant RNA viruses from the carla-, tricho- and potexvirus families. Examples of homologs found include: Leishmania L3377.4, iteration 5, e-value = 8 x 10-7; Drosophila CG17807, iteration 3, e-value = 4 x 10-6; papaya mosaic virus, iteration 3, e-value = 2 x 10-4. .urther iterations of the search using each of the detected proteins as a new query resulted in the detection of several more eukaryotic proteins, including EGL-9 and leprecan, several uncharacterized bacterial proteins and prolyl and lysyl hydroxylases. .inally, another iteration of database searches initiated with the sequences of bacterial proteins, typified by E. coli YbiX, resulted in the unification of these proteins with plant dioxygenases such as leucoanthocyanidin oxidase and gibberellin-20 oxidase. In this context, it should be noted that the DNA-repair proteins typified by E. coli AlkB are unrelated to the alkane omega-hydroxylase typified by the

Figure 1 (see pages 3 and 4) Multiple sequence alignment of the 2OG-Fe(II) dioxygenase superfamily. Individual protein families are separated by blank lines and a brief description of each family is given to the right of the alignment. The numbers at the ends of the alignment indicate the position of the first and last of the aligned residues in the respective protein sequences. The consensus secondary structure is shown above the alignment in uppercase letters. It was derived by taking those elements that are shared by the predicted structures of individual families and the experimentally determined structures; H indicates = helix and E indicates extended conformation (> strand). The lowercase letters represent extensions of the secondary structure elements that are seen in some, but not all, members of the superfamily. The conserved amino-terminal extensions that are specific only to a given family are separated from the rest of the alignment by vertical lines. The coloring of the alignment columns is according to the 85% consensus that is shown underneath the alignment and includes the following categories of amino acid residues: h, hydrophobic; l, aliphatic; a, aromatic (Y, F, W, H, L, I, V, M, A, all shaded yellow); s, small (S, A, G, T, V, P, N, H, D, shaded blue); b, big (K, R, E, Q, W, F, Y, L, M, I, shaded gray); +, positively charged (K, R, H; colored magenta). The (predicted) catalytic residues are indicated by asterisks and with reverse red shading. The proteins are designated by the protein/gene name, the species abbreviation and the gene identification (GI) number. Protein abbreviations are: CAS, clavaminic acid synthase; DAOCS, deacetoxycephalosporin C synthetase; EFE, ethylene-forming enzyme; FLAS, flavonol synthase; Ga20Ox, giberellin 20-oxidase; IPNS, isopenicillin N synthase; LDOX, leucoanthocyanidin hydroxylase; Lep, leprecan; P4HA, prolyl-4hydroxylase; PLO, lysyl hydroxylase; SanF and SanC, enzymes involved in nikkomycin biosynthesis. The remaining names are the standard names of the genes that encode the respective proteins. Species abbreviations: At, Arabidopsis thaliana; Bb, Borrelia burgdorferi; Cc, Caulobacter crescentus; Ce, Caenorhabditis elegans; Ci, Ciona intestinalis; Dm, Drosophila melanogaster; Ec, Escherichia coli; Em, Emericella nidulans; Hs, Homo sapiens; Lc, Lysobacter lactamgenus; Le, Lycopersicon esculentum; Mtu, Mycobacterium tuberculosis; Nc, Neurospora crassa; Pa, Pseudomonas aeruginosa; Pet, Petunia hybrida; Rr, Rattus rattus; Sc, Saccharomyces cerevisiae; Sp, Schizosaccharomyces pombe; Sot, Solanum tuberosum; Scoe, Streptomyces coelicolor; Scan, Streptomyces ansochromogenes; Scla, Streptomyces clavuligerus; Ssp, Synechocystis; Vc, Vibrio cholerae; ASPV, apple stem pitting virus; ACLSV, apple chlorotic leaf spot virus; BSV, blueberry scorch virus; GLV, garlic latent virus; GVA, grapevine virus A; PBCV, Paramecium bursaria chlorella virus; PMV, papaya mosaic virus; SHVX, shallot virus X.

http://genomebiology.com/2001/2/3/research/0007.3

Secondary Structure: .hhhhhhHHHHHHHHHHHhhhhhhhhhhhh........EEEEEEEEE.................eeeeeee.. YBIX_Ec_3025022 23 PQDVARFREQLEQAEWVDGRVTTGAQGAQVKNNQQVDT|RSTLYAALQNEVLNAVNQHALFFAAA-------LPRTLSTPLFNRYQ-------------NNETYGFHVDGAV PA4515_Pa_9950756 11 AEEVSRIRAALEQAEWADGKATAGYQSAKAKHNLQLPQ|DHPLAREIGEAMLQRLWNHPLFMSAA-------LPLKVFPPLFNCYT-------------GGGSFDFHIDNAV Ybix_Bsep_5777381 11 PAEAAQIRARLEAADWVDGKVTAGYQSAQVKHNRQLSE|QHPLAQELGGLILQRLAANNLFMSAA-------LPRKIFPPLFNRYE-------------GGEAFGYHVDNAL XF0598_Xf_9105464 11 RTQATSMQERLAAANWTDGRETVGPQGAQVKHNLQLPE|TSPLRQELGNEILDALARSPLYFAAT-------LPLRTLPPRFNCYQ-------------ENHQYGFHVDGAV Lep_Rr_5805194 497 SPHTPNEKFYGVTVLKALKLGQEGK--VPLQS|AHMYYNVTEKVRRVMESYF-------------RL-DTPLYFSYSHLVCRTAIE-ESQAERKDSSHPVHVDNCI AK025841_Hs_10438479 159 SPHTPNEKFYGVTVFKALKLGQEGK--VPLQS|AHLYYNVTEKVRRIMESYF-------------RL-DTPLYFSYSHLVCRTAIE-EVQAERKDDSHPVHVDNCI FLJ10718_Hs_8922619 317 SPHTPNEKFEGATVLKALKSGYEGR--VPLKS|ARLFYDISEKARRIVESYF-------------ML-NSTLYFSYTHMVCRTALS-GQQDRRNDLSHPIHADNCL Lep_Ci_9229924 216 SPHSEHELFQGMTVYLAAKLAEKGK--VPPQT|AALYYKLSEEARLQVKMYF-------------KL-TQELYFDYTHLVCRTTVK-GKPVKRTDLSHPVHSDNCL AK025875_Hs_10438524 146 DLHSGALSVGKHFVNLYRYFGDKIQNIFSEED|FRLYREVRQKVQLTIAEAF-------------GISASSLHLTKPTFFSRINST-EARTAH-DEYWHAHVDKVT EGL-9_Ce_5923812 418 FHIKDIRSDHIYWYDGYDGR|AKDAATVRLLISMIDSVIQHFKKRIDHDIG------GRSRAMLAIYP------------GNGTRYVKHVDNPV AK025273_Hs_10437756 68 VSKRHLRGDQITWIGGNEEG|---CEAISFLLSLIDRLVLYCGSRLGKYYV-----KERSKAMVACYP------------GNGTGYVRHVDNPN CG1114_Dm_7296772 97 VRGDKIRGDKIKWVGGNEPG|---CSNVWYLTNQIDSVVYRVNTMKDNGILGNYHIRERTRAMVACYP------------GSGTHYVMHVDNPQ VCA0949_Vc_9658386 53 QRAADIRSDKIQWLDLSMGQ|PVQDYLERMEQIRCEVNRHFFLGL-----------FEYEGAHFAKYE-------------AGDFYLKHLDSFR PA0310_Pa_9946156 85 VIREGIRGDLTQWLEPGESE|ACDEYLGVMDSLRQALNASLFLGL-----------EDFEGCHFALYP-------------PGAYYQKHVDRFR P4HA_Hs_4505565 360 LSRATVHDPE--TGKLTTAQYRVSKSAWLSGYENP|-----VVSRINMRIQDLTGLDV-------------STAEELQVANYG-------------VGGQYEPHFDFAR CG9728_Dm_7301967 798 MHRSTVNPLP--GGQLKKSAFRVSKNAWLAYESHP|-----TMVGMLRDLKDATGLDT-------------TFCEQLQVANYG-------------VGGHYEPHWDFFR CG1546_Dm_10726875 351 MVRSAVAGS---GGEGTVRDLRVSQQTWLDYKS-P|-----VMNSVGRIIQFVSGFDM-------------AGAEHMQVANYG-------------VGGQYEPHPDYFP4HA_Ce_1709529 352 LARATVHDSV--TGKLVTATYRISKSAWLKEWEGD|-----VVETVNKRIGYMTNLEM-------------ETAEELQIANYG-------------IGGHYDPHFDHAK P4HA_Dm_7301948 141 FRRATVQNSV--TGALETANYRISKSAWLKTQEDR|-----VIETVVQRTADMTGLDM-------------DSAEELQVVNYG-------------IGGHYEPHFDFAR CG9713_Dm_7301967 354 LKRATVYQAS--SGRNEVVKTRTSKVAWFPDGYNP|-----LTVRLNARISDMTGFNL-------------YGSEMLQLMNYG-------------LGGHYDQHYDFFN C14E2.4_Ce_1166595 116 MNEQKVVRDD--GEIAYSTYRQANGTITPAHSHAE|-----AQSLMDTATQLLPVFDF-------------QYTEQISALSYI-------------KGGHYALHTDFLT Y43F8B.4.b_Ce_3947627 347 FEEQLVVNDD--GNDIVSKIRRANGTQVFHEDHPA|-----ARSIWDTAKNLLPNLNF-------------KTAEDILALSYN-------------PGGHYAAHHDYLL T20B3.7_Ce_7508029 39 LEIQKTS-DF--GTSIETTHRRANGSFIPPEDSNV|-----TVEIKMQAQKRIPGLNL-------------TVAEHFSALSYL-------------PGGHYAVHYDYLD CG9708_Dm_10726879 1134 LERAKVFRVE--KGSDEIDPSRSADGAWLPHQNID|PDDLEVLNRIGRRIEDMTGLNT-------------RSGSKMQFLKYG-------------FGGHFVPHYDYFN P4HA_At_6437556 70 LQRSAVADND--NGESQVSDVRTSSGTFISKGKDP|-----IVSGIEDKLSTWTFLPK-------------ENGEDLQVLRYE-------------HGQKYDAHFDYFP4AH_PBCV_9631654 76 LIKSEVGGATENDPIKLDPKSRNSEQTWFMPGEHE|VIDKIQKKTREFLNSKKHCIDK-------------YNFEDVQVARYK-------------PGQYYYHHYDGDD B12F1.100_Nc_9368561 314 YLPDAPIKED--GSES---SILAHNVYWIVDQTFH|DALW-ERVRPFIPTHDAGRKAR-------------GINRRFRVYRYV-------------PGAEYRCHFDGAW PLOD_Ce_6093732 565 ERFCEELIEEMEGFGRWSDGSNNDKRLAGGYENVPTRDIHMNQVGF|ER-QWLYFMDTYVRPVQEKTFIGYYHQ-------PVESNMMFVVRYK-----------PEEQPSLRPHHDAST PLO_Dm_7294743 556 DAFCDDLVAIMEAHNGWSDGSNNDNRLEGGYEAVPTRDIHMKQVGL|ER-LYLKFLQMFVRPLQERAFTGYFHN-------PPRALMNFMVRYR-----------PDEQPSLRPHHDSST PLO2_Hs_6093730 573 EKACDELVEEMEHYGKWSGGKHHDSRISGGYENVPTDDIHMKQVDL|EN-VWLDFIREFIAPVTLKVFAGYYT--------KGFALLNFVVKYS-----------PERQRSLRPHHDAST PLO1_Hs_400205 563 EVACDELVEEMEHFGQWSLGNNKDNRIQGGYENVPTIDIHMNQIGF|ER-EWHKFLLEYIAPMTEKLYPGYYT--------RAQFDLAFVVRYK-----------PDEQPSLMPHHDAST PLO3_Hs_6093731 574 EQMCDELVAEMEHYGQWSGGRHEDSRLAGGYENVPTVDIHMKQVGY|ED-QWLQLLRTYVGPMTESLFPGYHT--------KARAVMNFVVRYR-----------PDEQPSLRPHHDSST MRC8.21_At_9294077 177 PSFCEMMLAEIDNFERWVGETKFRIMRPN---TMNKYGAVLDDFGL|DT-MLDKLMEGFIRPISKVFFSDVGGA-------TLDSHHGFVVEYG-----------KDRDVDLGFHVDDSE K9D7.16_At_10178205 135 PDFFQKLLVEVENMRKWLHEAKLMIRKPN---NKSKYGVVLDDFGM|DI-MLKPLVEDFIFPICKVFFPQVCGT-------MFDTQHGFVIENC-----------EDRDAELGFHVENSD AK023553_Hs_10435524 87 APFCQALLEELEHFEQSDMPKGRPN-------TMNNYGVLLHELGL|DEPLMTPLRERFLQPLMALLYPDCGGG-------RLDSHRAFVVKYA-----------PGQDLELGCHYDNAE SPBC6B1.08c_Sp_7491725 91 NMLKYLRQVRDA|LY----SEEFRSHVQKITGCGP-----------LSASKKDLSVNVYS-------------KGCHLMNHDDVIYer049wp_Sc_731462 108 SRLPNLFKLRQI|LY----SKQYRDFFGYVTKAGK-----------LSGSKTDMSINTYT-------------KGCHLLTHDDVIKIAA1612_Hs_10047299 115 RREPHISTLRKI|LF-----EDFRSWLSDISKID------------LESTIFDMSCAKYE--------------TDALLCHDDELCG18761_Dm_10726758 137 MPACRLLTNFLQ|VL----RKQVRPWLEKVTNLK------------LD--YVSASCSMYT-------------CGDYLLVHDDLLC17G10.1_Ce_861277 104 INPVENPAVFSF|RQ--FLYKEVKEWLQNVSGVE------------LTE-QVDCNGSCYA-------------RTDSLLPHNDLISanC_Scan_7407127 77 DQVRTLPANWQQ|LVADVVSRPYREALSALTGVD------------LERCLVEARMTRYA-------------RGCWIEPHTDRPSanF_Scan_8389327 71 PVFPRLPGVWRD|LVEDLRGAEFTAWLEKSTGIE------------LAGLQRSIGLYTHR-------------NGDYLSVHKDKPsll0428_Ssp_1653558 203 MFLPIFAPFSEL|II-----ERVKAILPQLISDLQ--------IQPFQIDYVEAQLTAHN-------------HGNYYKVHNDNGsll0191_Ssp_7469866 154 LFSQKIPALSAL|IR-----ERIKQKLPELLGQLN--------FSPFEVAEIELQLTAHN-------------DGCYYRIHNDAGCAS_Scla_322266 53 FNAEGSEDGHLLLRGL--PVEADADLPTTP-----------SSTP|----APEDRSLLTMEAMLGLVGRRLGLHTG--YRELRSGTVYHDVYP-SPGAHHL-SSETSETLLEFHTEMAIPNS_En_124825 104 CYLNPNFTPDHPRIQAKTPTHEVNVWPDETKHPG----FQDFAEQ|YYWDVFGLSSALLKGYALALGKEENFFARH-FKPDDTLASVVLIRYPYLDPYP3KTAADGTKLSFEWHEDVSFLAS_Pet_421946 136 VEGKKGWVDHLFHKIWPPSAVNYRYWPKNPPS------YREANEE|YGKRMREVVDRIFKSLSLGLGLEGHEMIEA-AGGDEIVYLLKINYYP--PCPR-----PDLALGVVAHTDMSLDOX_Pet_1730108 137 ACGQLEWEDYFFHCAFPEDKRDLSIWPKNPTD------YTPATSE|YAKQIRALATKILTVLSIGLGLEEGRLEKEVGGMEDLLLQMKINYYP--KCPQ-----PELALGVEAHTDVSSrg_At_479047 135 EDQKLDWADLFFHTVQPVELRKPHLFPKLPLP------FRDTLEM|YSSEVQSVAKILIAKMARALEIKPEELEKL-FDDVDSVQSMRMNYYP--PCPQ-----PDQVIGLTPHSDSVEFE_Le_398992 80 EVTDLDWESTFFLRHLPTS--NISQVPDLDEE------YREVMRD|FAKRLEKLAEELLDLLCENLGLEKGYLKNAFYGSKGPNFGTKVSNYP--PCPK-----PDLIKGLRAHTDAGGa20Ox_Sot_10800976 156 LQGSSHMVDQYFLKTM------GEDFSH----------IGKFYQE|YCNAMSTLSSGIMELLGESLGVSKNHFKQ---FFEENESIMRLNYYP--TCQK-----PDLALGTGPHCDPTPA0147_Pa_9945977 97 FKETFDMALHLPAEHPDVRAGKSFYGPNRHPDLP--GWEALLEGH|Y-ADMLALARTVLRALAIALGIEEDFFDR---RFEQPVSVFRLIHYP--PASA---RQSADQPGAGAHTDYGPA4191_Pa_9950401 98 WKEGLYLGSELDAEHPEVRAGTPLHGANLFPEVP---GLRETLLE|YLDATTRVGHRLMEGIALGLGLEADYFAAR--YTGDPLILFRLFNYPSQPVPE----GLDVQWGVGEHTDYGISP7_Sp_729862 181 FRESFYFGNDNLSKDRLLR---PFQGPNKWPSTAGS-SFRKALVK|YHDQMLAFANHVMSLLAESLELSPDAFDE---FCSDPTTSIRLLRYP----------SSPNRLGVQEHTDADSPCC1494.01_Sc_7491815 81 FGEKLDMEHQSRGDLKESYDLAGFPDPKLENLCPFIAEHMDEFLQ|FQRHCYKLTLRLLDFFAIGFGIPPDFFSK---SHSSEEDVLRLLKYSI-PEGV---ERREDDEDAGAHSDYGDAOCS_Lyl_769809 89 CTNTGDYSDYAMVYSMGIS---GNIFPTAH--------FERLWSD|YFDLYYGISRQAARAVLESMDVHLNTDID---ALIDCDPVLRYRYFPDVPEDR---CAEQQPNRMAPHYDLSRRPO_SHVX_548840 610 KGRLAAFYSRDGQGYSYTGYSH--|KSQGWL-EGLDKLIEACGEKPT--------------TYNQCLVQKYE-------------QGSRIGFHSDEQA POL_ASPV_487652 719 KGRGASFYSRDLKGYSYTGFSH--|VSRGWP-AFLDKFLSDNKIPLN--------------FYNQCLVQEYS-------------TGHGLSMHKDDES POL_BSV_409711 708 KGRDAAWYSKEDREYKYNGGSH--|LCRGWP-KWLQLWMQANGVDE---------------TYDCMLAQRYG-------------AQGKIGFHADNEE RRPO_PMV_139137 554 TGRKAWFFSKDGKPYSYTGGSH--|ASRGWP-NWLEKILAAIEIKEP------------LPEFNQCLVQQFK-------------LQAAIPFHRDDEP POL_GLV_1154656 638 SNRDAAFYSKGSFSYNCNGGYH--|TGQEWL-GEFDSFLAINGHDLE--------------YFNCVLFQKYD-------------GGHGIGFHRDDEE Pol_GVA_1405615 603 KGREVAFYSRHSKEYKYNGGSH--|RSLGWD-EALNELTQELGLDD---------------SYDHCLIQRYT-------------AGGSIGFHADDEP RRPO_ACLSV_1710717 704 NRKAAYFCIDYPMVYFHDKISY--|PTFEAT-GEIKQIIMRARDKWG-------------ANFNSALIQVYN-------------DGCRLPLHSDNEE T13L16.2_At_2708738 266 RGKGRETIQFGCCYNYAPDRAG--|NPPGILQREEVDPLPHLFKVIIRKLIK-WHVLPPTCVPDSCIVNIYD-------------EGDCIPPHIDNHD T19K4.220_At_3036813 264 RGKGRVTIQFGCCYNYAPDKAG--|NPPGILQRGDVDPMPSIFKV----------------IIKSCIVNIYE-------------EDDCIPPHIDNHD At2g48080_At_4249414 197 ETFVLFNKNTKGTKRELLQLGVP-|IFGNTTDEHSVEPIPTLVQSVIDHLLQWRL-IPEYKRPNGCVINFFDQ------------P-FQKPPHVD--AK000315.1_Hs_7020317 111 APLRNKYFFGEGYTYGAQLQKRGP|GQERLYPPGDVDEIPEWVHQLVIQKLVEHRVIP--EGFVNSAVINDYQ------------PGGCIVSHVDPIH CG17807_Dm_7291441 167 GSLKHRNVKHFGFEFLYGTNN---|VDPSKP---LEQSIPSACDILWPRLNSFASTW-DWSSPDQLTVNEYE-------------PGHGIPPHVDTHS CG6144_Dm_7297712 26 LSHIERTPKPRWTQLLNRRLVNY-|GGVPHPNGMIAEEIPEWLQTYVDK--VNNLGVFESQNANHVLVNEYL-------------PGQGILPHTDGPL CG4036_Dm_7297561 93 DLDDLPWDISQSGRRKQNFGPKTN|FKKRKLRLGSFAGFPRTTEYVQRRFED--VPLLRGFQTIEQCSLEYEPS-----------KGASIDPHVDDCW FLJ2001_Hs_38923019 91 LMDRDPWKLSQSGRRKQDYGPKVN|FRKQKLKTEGFCGLPSFSREVVRRMGL--YPGLEGFRPVEQCNLDYCPE-----------RGSAIDPHLDDAW C14B1.10_Ce_6580210 188 QSLKHRAVVHFGHVFDYSTNS---|-ASEWK---EADPIPPVINSLIDRLIS---DKYITERPDQVTANVYE-------------SGHGIPSHYDTHS SPAP8A3.02c_Sp_7491301 62 VQRNLINNVPKELLSIYGSGKQ--|SHLYIPFPAHINCLNDYIPSDFKQR------LWKGQDAEAIIMQVYN-------------PGDGIIPHKDLEM L3377.4_Lm_9989036 49 DASNLRKGYVDVYTRASDRIILND|GRFQLPPLPPASFMPLLERLEQDN-------VVPKSWLNNQTANLYE-------------PGDFIRAHIDNLF MTCI237.14c_Mtu_2052134 58 RRQMYDRVVDVPRLVSFHDLTI--|EDPPHPQLARMRRRLNDIYGGELG----------EPFTTAGLCYYRD-------------GSDSVAWHGDTIG AlkB_Cc_2055386 43 ALGSLGWTSDARGYRYVDRHPE--|TGRPWP-DMPPALLDLWTVLGD-----------PETPPDSCLVNLYA-------------TGARMGLHQDRDE ALKB_Ec_113638 63 NCGHLGWTTHRQGYLYSPIDPQ--|TNKPWPAMPQSFHNLCQRAATAAGY--------PDFQPDACLINRYA-------------PGAKLSLHQDKDE AlkB_Scoe_8894829 60 RQVCLGRHWYPYGYAATAVDGD--|GAPVKPFPARLDGLARRAVTDALGAEAV-----APAPYDIALINFYD-------------ADARMGMHRDADE AlkB_At_4835778 173 LLRKLRWSTLGLQFDWSKRNYD--|--VSLPHNNIPDALCQLAKTHAAI----AMPDGEEFRPEGAIVNYFG-------------IGDTLGGHLDDME AlkB_Sp_3080529 131 VHKKLRWVTLGEQYDWTTKEYP--|DPSKSP-GFPKDLGDFVEKVVK------ESTDFLHWKAEAAIVNFYS-------------PGDTLSAHIDESE AlkB_Hs_2134723 89 LLEKLRWVTVGYHYNWDSKKYS--|ADHYTP--FPSDLGFLSEQVAAAC-------GFEDFRAEAGILNYYR-------------LDSTLGIHVDRSE Consensus(85%): ........................|..........................................h..a..................h..H.D... * *

comment reviews reports deposited research refereed research interactions

Figure 1 (continued on the next page)


information

Ps. oleovorans protein also named AlkB. .ortuitously, these latter alkane hydroxylases are also oxygenases; however, they are not 2OG-.e(II) dioxygenases but a distinct class of di-iron enzymes [23]. On the basis of the results of database searches with representative sequences, we delineated

several distinct families within the 2OG-.e(II) dioxygenase fold and constructed individual alignments for each using the ClustalW program [24]. Secondary structure was predicted for each family using the PHD [25] and PSI-PRED [26] programs. Using the secondary structure elements

4 Genome Biology

Vol 2 No 3

Aravind and Koonin

Secondary Structure: YBIX_Ec_3025022 PA4515_Pa_9950756 Ybix_Bsep_5777381 XF0598_Xf_9105464 Lep_Rr_5805194 AK025841_Hs_10438479 FLJ10718_Hs_8922619 Lep_Ci_9229924 AK025875_Hs_10438524 EGL-9_Ce_5923812 AK025273_Hs_10437756 CG1114_Dm_7296772 VCA0949_Vc_9658386 PA0310_Pa_9946156 P4HA_Hs_4505565 CG9728_Dm_7301967 CG1546_Dm_10726875 P4HA_Ce_1709529 P4HA_Dm_7301948 CG9713_Dm_7301967 C14E2.4_Ce_1166595 Y43F8B.4.b_Ce_3947627 T20B3.7_Ce_7508029 CG9708_Dm_10726879 P4HA_At_6437556 P4AH_PBCV_9631654 B12F1.100_Nc_9368561 PLOD_Ce_6093732 PLO_Dm_7294743 PLO2_Hs_6093730 PLO1_Hs_400205 PLO3_Hs_6093731 MRC8.21_At_9294077 K9D7.16_At_10178205 AK023553_Hs_10435524 SPBC6B1.08c_Sp_7491725 Yer049wp_Sc_731462 KIAA1612_Hs_10047299 CG18761_Dm_10726758 C17G10.1_Ce_861277 SanC_Scan_7407127 SanF_Scan_8389327 sll0428_Ssp_1653558 sll0191_Ssp_7469866 CAS_Scla_322266 IPNS_En_124825 FLAS_Pet_421946 LDOX_Pet_1730108 Srg_At_479047 EFE_Le_398992 Ga20Ox_Sot_10800976 PA0147_Pa_9945977 PA4191_Pa_9950401 ISP7_Sp_729862 SPCC1494.01_Sc_7491815 DACCS_Lyl_769809 RRPO_SHVX_548840 POL_ASPV_487652 POL_BSV_409711 RRPO_PMV_139137 POL_GLV_1154656 Pol_GVA_1405615 RRPO_ACLSV_1710717 T13L16.2_At_2708738 T19K4.220_At_3036813 At2g48080_At_4249414 AK000315.1_Hs_7020317 CG17807_Dm_7291441 CG6144_Dm_7297712 CG4036_Dm_7297561 FLJ2001_Hs_38923019 C14B1.10_Ce_6580210 SPAP8A3.02c_Sp_7491301 L3377.4_Lm_9989036 MTCI237.14c_Mtu_2052134 AlkB_Cc_2055386 ALKB_Ec_113638 AlkB_Scoe_8894829 AlkB_At_4835778 AlkB_Sp_3080529 AlkB_Hs_2134723 Consensus(85%):

...................EEEEEEE.............EEEEEEE...............EEEEE...EEEEEE...............EEEEEE........EEEEEEEEEE. RSHPQN--------GWMRTDLSATLFLSDPQSY------DGGELVVNDTFGQ---------HRVKLPA-GDLVLYPS-----------SSLHCVTPVT----RGVRVASFMWIQS RDVHGGR-------ERVRTDLSSTLFFSDPEDY------DGGELVIQDTYGL---------QQVKLPA-GDLVLYPG-----------TSLHKVNPVT----RGARYASFFWTQS RPVPGTA-------ERVRTDLSATLFFSEPDSY------DGGELVVDDTYGP---------RTVKLPA-GHMVLYPG-----------TSLHKVTPVT----RGARISAFFWLQS MSLPIAPG---HTPASLRSDISCTLFLNDPDEY------EGGELIIADTYGE---------HEVKLPA-GDLIIYPS-----------TSLHRVAPVT----RGMRIASFFWVQS LNAESLVCIK-EPPAYTFRDYSAILYLN-GD-------FDGGNFYFTELDAKTV-------TAEVQPQCGRAVGFSSG---------TENPHGVKAVT----RGQRCAIALWFTL LNAETLVCVK-EPLAYTFRDYSAILYLN-GD-------FDGGNFYFTELDAKTV-------TAEVQPQCGRAVGFSSG---------TENPHGVKAVT----RGQRCAIALWFTL LDPEANECWK-EPPAYTFRDYSALLYMN-DD-------FEGGEFIFTEMDAKTV-------TASIKPKCGRMISFSSG---------GENPHGVKAVT----KGKRCAVALWFTL LK-ENGSCLK-ERPAYTWRDYSAILYLN-DE-------FEGGEFIMTDATARRV-------KVQVRPKCGRLVSFSAG---------KECLHGVKPVT----KGRRCAMALWFTM YG---------------SFDYTSLLYLS-NYLED----FGGGRFMFMEEGAN----------KTVEPRAGRVSFFTSG---------SENLHRVEKVH----WGTRYAITIAFSC K--------------DGRCITTIYYCNENWDMA-----TDGGTLRLYPETSMT--------PMDIDPR-ADRLVFFWSD--------RRNPHEVMPVF-----RHRFAITIWYMD G--------------DGRCITCIYYLNKNWDAK-----LHGGILRIFPEGKSF--------IADVEPI-FDRLLFFWSD--------RRNPHEVQPSY-----ATRYAMTVWYFD K--------------DGRVITAIYYLNINWDAR-----ESGGILRIRPTPGTT--------VADIEPK-FDRLIFFWSD--------IRNPHEVQPAH-----RTRYAITVWYFD G-------------NENRKLTTVFYLNENWTP------ADGGELKIYDLQDNW--------IETLAPV-AGRLVVFLS---------ERFPHEVLEAH-----ADRVSIAGWFRT D-------------DDARTVSAVLYLNDAWLP------EHGGALRLHLPQR----------QVDIQPT-GGSLVVFMS---------AGTEHEVLPAS-----RDRLSLTGWFRR KDEPDAF----RELGTGNRIATWLFYMSDVS--------AGGATVFPEVG------------ASVWPKKGTAVFWYNLFA--SGEGDYSTRHAACPVL----VGNKWVSNKWLHE DPNHY-------PAEEGNRIATAIFYLSEVE--------QGGATAFPFLD------------IAVKPQLGNVLFWYNLHR--SLDKDYRTKHAGCPVL----KGSKWIGNVWIHE EVNLP-------KNFEGDRISTSMFYLSDVE--------QGGYTVFTKLN------------VFLPPVKGALVMWHNLHR--SLHVDARTLHAGCPVI----VGSKRIGNIWMHS KEESKSF----ESLGTGNRIATVLFYMSQPS--------HGGGTVFTEAK------------STILPTKNDALFWYNLYK--QGDGNPDTRHAACPVL----VGIKWVSNKWIHE KEEQRAF----EGLNLGNRIATVLFYMSDVE--------QGGATVFTSLH------------TALFPKKGTAAFWMNLHR--DGQGDVRTRHAACPVL----TGTKWVSNKWIHE KTNSN------MTAMSGDRIATVLFYLTDVE--------QGGATVFPNIR------------KAVFPQRGSVVMWYNLKD--NGQIDTQTLHAACPVI----VGSKWVCNKWIRE FANAEDSNR--HFGEMGNRLATFIMVFKKAE--------KGGGTLFPQLG------------NVFRANPGDAFLWFNCNG--NLEREAKSLHGGCPIR----AGEKIIATIWIRI YPSEKEWDE--WMRVNGNRFGTLIMAFGAAE--------SGGATVFPRLG------------AAVRTKPGDAFFWFNAMG--NSEQEDLSEHAGCPIY----KGQKQISTIWLRM YRSKQDYDW--WMNKTGNRIGTLIFVLKPAE--------KGGGTVFPSIG------------STVRANAGDAFFWFNAQA--DEEKEMLSNHGGCPIY----EGRKVIATIWIRA SKTFS-------LETVGDRIATVLFYLNNVD--------HGGATVFPKLN------------LAVPTQKGSALFWHNIDR-KSYDYDTRTFHGACPLI----SGTKLVVPQNISHDKVN-------IARGGHRIATVLLYLSNVT--------KGGETVFPDAQ------------VCLKPKKGNALLFFNLQQ--DAIPDPFSLHGGCPVI----EGEKWSATKWIHV CDDACP---------KDQRLATLMVYLKAPEEG------GGGETDFPTLK------------TKIKPKKGTSIFFWVADP-VTRKLYKETLHAGLPVK----SGEKIIANQWIRA PPSGIHPTDASPADKKQSSLFTFLMYLNDEF--------EGGETTFFTPSVRDGVMNAH----PVRPVMGSVAVFPHG------ENHGALLHEGTGVR----KGAKYIIRTDVEF ------------------FSIDIALNKKGRD-------YEGGGVRYIRYNC-----------TVPADE--VGYAMMF-PG------RLTHLHEGLATT----KGTRYIMVSFINP ------------------YTINIAMNRAGID-------YQGGGCRFIRYNC-----------SVTDTK--KGWMLMH-PG------RLTHYHEGLLVT----NGTRYIMISFIDP ------------------FTINIALNNVGED-------FQGGGCKFLRYNC-----------SIESPR--KGWSFMH-PG------RLTHLHEGLPVK----NGTRYIAVSFIDP ------------------FTINIALNRVGVD-------YEGGGCRFLRYNC-----------SIRAPR--KGWTLMH-PG------RLTHYHEGLPTT----RGTRYIAVSFVDP ------------------FTLNVALNHKGLD-------YEGGGCRFLRYDC-----------VISSPR--KGWALLH-PG------RLTHYHEGLPTT----WGTRYIMVSFVDP ------------------VTLNVCLGNQ----------FVGGELFFRGTRCEKHVN------TATKAD--ETYDYCHIPG-QAVLHRGRHRHGARATT----CGHRVNMLLWCRS ------------------ITLNVCLSKQ----------SEGGEILFTGTRCNKHLK------AGPKPE--EIFEYCHEPG-QAILHLGCHSHGAKAAI----SCSRANMILWCIN ------------------LTLNVALGKV----------FTGGALYFGGLFQAPT--------ALTEPL--EVEHVVG-QG---VLHRGGQLHGARPLG----TGERWNLVVWLRA ----------------GTRCISYILYLVEPDEGWK--PEYGGALRLFPTLQPSFP--AADFCHSIPPQ-WNQLSFFRVKP-------GHSFHDVEEVYV---DKPRMAISGWFHY ----------------GSRRISFILYLPDPDRKWK--SHYGGGLRLFPSILPNVP--HSDPSAKLVPQ-FNQIAFFKVLP-------GFSFHDVEEVKV---DKHRLSIQGWYHI ----------------EGRRIAFILYLVPPWDR-----SMGGTLDLYSIDEHFQP--KQIV-KSLIPS-WNKLVFFEVS--------PVSFHQVSEVLS---EKSRLSISGWFHG ----------------KDRQVAFIYYLSPWEGAEEWTDEQGGCLEIFGSDDQCFP--QFPVQRKIAPK-DNQFAFFKVG--------SRSFHQVGEVTT---DYPRLTINGWFHG ----------------ETRRFAFVYYITSADWDSE---VNGGDLQLFNHDKKLQP--TSVA-AQFSPL-RNSFMLFEVS--------EKSWHRVAEMLS---EEPRLSINGWFHS -----------------DKAVTHLFYFND-GWDP----EWRGDLRLLRSADMAD------CAKRVAPT-TGTSVVLVRS--------DRSWHGVPPVADT--PVDRRALLVHFVR -----------------TKAITVILYLNR-DWPV----EAGGQFQIFASPKEGP-------TEEISPV-GGQLLAFPPT--------DKSWHAVSKIEHP--GTERITVQIEYWL ------------SPDSATRELTYVYYFNR-EPKA----FSGGELAIYDSKIENNFYVAAESFKTVQPV-NNSIVFFLS----------RYMHEVLPVNCP3-ADSRFTINGWVRK ------------SEKTASRQITYVYYFYQ-EPKA----FSGGELRLYDTELKNNTITTHPKFQTITPI-NNSIIFFNS----------RCRHEVMSVVCP3-AHSRFTVNGWIRK -------------YHRLQPNYVMLACSRADHE------RTAATLVASVRK---70---VTEAVYLEPG-DLLIVDNF-----------RTTHARTPFSPRWDGKDRWLHRVYIRT -------------------LITVLYQ------------SNVQNLQVETAA--------GYQDIEADDT-GYLINCGSYMAHLTNNYYKAPIHRVKWVN-----AERQSLPFFVNL -------------------YITILVP------------NEVQGLQVFKDG--------HWYDVKYIPN-ALIVHIGDQVEILSNGKYKSVYHRTTVNK----DKTRMSWPVFLEP -------------------ALTFILH------------NMVPGLQLFYEG--------QWVTAKCVPN-SIIMHIGDTIEILSNGKYKSILHRGVVNK----EKVRFSWAIFCEP -------------------GLTVLMQV-----------NDVEGLQIKKDG--------KWVPVKPLPN-AFIVNIGDVLEIITNGTYRSIEHRGVVNS----EKERLSIATFHNV -------------------GIILLFQD-----------DKVSGLQLLKDE--------QWIDVPPMRH-SIVVNLGDQLEVITNGKYKSVLHRVIAQT----DGTRMSLASFYNP -------------------SLTILHQ------------DSVSGLQVFMDN--------QWRSISPNLS-AFVVNIGDTFMALSNGRYKSCLHRAVVNN----KTPRKSLAFFLCP -------------------CVTLLYQ------------DAAGGLQVQNRQG-------EWIDAPPIDG-TFVVNIGDMMARWSNDRYRSTPHRVISPR----GVHRYSMPFFAEP -------------------LLTLLHQ------------DAIGGLQVRTPQ--------GWLEAPPIPG-SFVCNLGDMLERMTGGLYRSTPHRVARNTS---GRDRLSFPLFFDP -------------------ALTLMSQ------------DNVKGLEILDPVSN------CFLSVSPAPG-ALIANLGDIMAILTNNRYKSSMHRVCNNS----GSDRYTIPFFLQG -------------------SITLLFQ------------RDAAGLEIRPPNFVKDM---DWIKVNVQPD-VVLVNIADMLQFWTSGKLRSTVHRVRIDPG---VKTRQTIAYFVTP -------------------IVSLILQTPCP--------NGFVSLQVEIDG--------RFVEVPPRPG-CVVVFCGSIAPLVSDGKIKAPQHRVVS-PGA4-GSNRTSSVLFLRP IYPKG------------NKILTVNAA-------------GSGTFGI---------------KCAKGE-TTLNLEDGD-YFQMPSGFQETHKHNVVA------VTPRLSFTFRSTV IYDIN------------HQVLTVNYS-------------GDAIFCI---------------ECLGSGF-EIPLSGPQ-MLLMPFGFQKEHRHGIKSP-----SKGRISLTFRLTK IFMRG------------APVHTVSMD-------------GNADFGT---------------ECAAGR--QYTTLRGNVQFTMPSGFQETHKHAVRNT-----TAGRVSYTFRRLA CYPKG------------HQVLTINHS-------------GECLTQI---------------ACQKGKA-SITMGFGD-YYLSPVGFQESHKHAVSNT-----TGGRVSLTFRCTV IFEKD------------SKILTVCIQ-------------GDCEFRF---------------RCATGET-GFYMEAPK-QFMMPDGFQSNHVHAVREC-----TPGRISATFRRAK CYLPG------------GSVVTVNLH-------------GDATFEVK--------------ENQSGKIEKKELHDGD-VYVMGPGMQQTHKHRVTSH-----TDGRCSITLRNKT CYDD-------------DEILTINVV-------------GDAKFHT---------------TC-HGE--IIDLRQGD-EILMPGGYQKMNKHAVEVA-----SEGRTSVTLRVHK FL---------------RPFCTISFL-------------SECDILFGSNLKVE------GPGDFSGSY-SIPLPVGS-VLVLNGNGADVAKHCVPAV-----PTKRISITFRKMD FL---------------RPFCTVSFL-------------SECNILFGSNLKVL------GPGEFSGSY-SIPLPVGS-VLVLKGNGADVAKHCVPAV-----PTKRISITFRKMD -----------------QPISTLVL--------------SESTMVFGHRLGVD------NDGNFRGSL-TLPLKEGS-LLVMRGNSADMARHVMCPS-----PNKRVAITFFKLK IFE--------------RPIVSVSFF-------------SDSALCFGCKFQFK-------PIRVSEPVLSLPVRRGS-VTVLSGYAADEITHCIRPQDI---KERRAVIILRKTR AFL--------------DPILSLSLQ-------------SDVVMDFRRG---------------DDQV-QVRLPRRS-LLIMSGEARYDWTHGIRPKHID13RGKRTSLTFRRLR FH---------------PIISTISTG-------------AHTVLEFVKREDTTTETEAGDQTTREVLF-KLLLEPRS-LLILKDTLYTDYLHAISETSED24RSPRISLTIRNVP IWGERVVTVNC------LGDSVLTLT--PYEVQQSGKYNLDLVASYEDELLAP-LLTDDQLATFEGKVLRIPMPNLS-LIVLYGPARYQFEHSVLREDV---QERRVCVAYREFT LWGERLVSLNL------LSPTVLSMC-----REAPGSLLLCSAPSAAPEALVDSVIAPSRSVLCQEVEVAIPLPARS-LLVLTGAARHQWKHAIHRRHI---EARRVCVTFRELS AFD--------------DPIVSISLL-------------SDVVMEFKD-------------GANSARIAPVLLKARS-LCLIQGESRYRWKHGIVNRKYD10RQTRVSLTLRKIR FGDG-------------VAIFSFLSN-------------TTMIFTHPE--------------LKLKS--KIRLEKGS-LLLMSGTARYDWFHEIPFRAGD12RSQRLSVTMRRII VYD--------------DIFAICSLG-------------SNCLLRFVH-------------VQNGEEL-DVMVPDRS-VYIMSGPARYVYFHMVLPV-----EAQRFSLVFRRSI RGSTEDTM---------VAIVSLGAT-------------RVFALRP----------------RGRGPSLRLPLAHGD-LLVMGGSCQRTFEHAVPKTSAP--TGPRVSIQFRPRD ADPR-------------FPLLSISLG-------------DTAVFRIGG-------------VNRKDPTRSLRLASGD--VCRLLGPARLAFHGVDRILPG6-GGGRINLTLRRAR PDLR-------------APIVSVSLG-------------LPAIFQFGG-------------LKRNDPLKRLLLEHGD--VVVWGGESRLFYHGIQPLKAG5-IDCRYNLTFRQAG RTD--------------APVVSLSLG-------------DTCVFRFGN------------PETRTRPYTDTELRSGD--LFVFGGPSRLAYHGVPRVHPG7-LRGRLNITLRVSG ADWS-------------KPIVSMSLG-------------CKAIFLLGGK-------------SKDDPPHAMYLRSGD--VVLMAGEARECFHGNLLHFQL34KTSRININIRQVF EDLT-------------LPLISLSMG-------------LDCIYLIGTE------------SRSEKPS-ALRLHSGD--VVIMTGTSRKAFHGKHC-------SFKYLIYSQLIA LDHS-------------KPLLSFSFG-------------QSAIFLLGGL------------QRDEAPP-PMFMHSGD--IMIMSGFSRLLNHAVPRVLPN39KTARVNMa RQVL .....................sh.h................s...h....................s.....h..................H.s...........+h.h..b... * *

189\YbiX 178|family 178| 182/ 670\ 332| 490|Leprecans 388| 307/ 566\ 214|EGL-9 248|family 193| 223/ 519\ 903| 504| 511| 300| 511|Prolyl 277|hydroxylase 508|family 199| 129| 225| 240| 488/ 730\ 721| 737|Lysyl 727|hydroxylase 738|family 347| 306| 249/ 230\ 247| 247| 275|SanF/ 240|SanC 212|family 205| 349| 300/ 302\ 288| 309| 311| 309| 253|Small 317|molecule 274|dioxygenases 277| 353| 267| 268/ 743\ 853| 841|RNA 690|viral 772|AlkB 738|homologs 836/ 420\ 403| 351| 270| 325|Eukaryotic 213|Family of 278|AlkB 274|paralogs 343| 219| 193/ 203\ 190| 213|Classic 215|AlkB 354| 272| 272/

Figure 1 (continued from the previous page)

from the experimentally determined structures of IPNS (PDB:1ips), DAOCS (PDB:1dcs), CAS (PDB:1drt) and the predicted secondary structures for individual families, the conserved core of these elements was delineated (.igure 1). A multiple alignment for the entire superfamily was constructed by aligning the conserved sequence features shared by all the individual families. The boundaries of the

secondary structure elements in the alignment were adjusted using the secondary structure conservation as a guide. The conserved portion of the 2OG-.e(II) dioxygenase superfamily proteins comprises the core DSBH domain seen in the IPNS, DAOCS and CAS structures [8-10] and part of the amino-terminal = helix that showed considerable variability

http://genomebiology.com/2001/2/3/research/0007.5

dyad is located in a flexible loop that follows the first conserved strand and stacks with the sheet containing three of the core strands (.igures 2,3). The second conserved histidine is associated with the beginning of the sixth strand, whereas the conserved basic residue (R or K) is in the beginning of the seventh strand of the DSBH core (.igures 2,3). The sixth position after the conserved basic residue is invariably occupied by a bulky residue that is either arginine (in AlkB) or phenylalanine or tryptophan in all other members of this fold [8-10] (.igure 1). The roles of all these conserved residues in catalysis are apparent from the crystal structures of IPNS, clavaminic acid synthase (CAS) and deacetoxycephalosporin C synthase (DAOCS) [8-10]. The conserved HXD and the carboxyterminal histidine coordinate .e(II) that is directly involved in catalysis by these enzymes (.igure 2). The conserved basic residue interacts with the carboxylate group of the acidic substrate, while the bulky or aromatic residue located carboxy-terminal to the basic one forms the base of the cleft that holds this substrate molecule (.igure 2). Whereas most of these enzymes necessarily use 2-OG as the acidic substrate, IPNS does not bind 2-OG. Its place is occupied by its single substrate, L-@-(=-aminoadipoyl)-L-cysteinyl-D-valine, whose carboxyl sidechain interacts with the conserved basic residue similarly to the OG-carboxylate in the other enzymes of this superfamily [9,10]. The conservation of all the residues implicated in catalysis in the biochemically uncharacterized proteins such as AlkB, EGL-9, leprecan and YbiX suggests they all catalyze oxidative reactions similar to those catalyzed by IPNS, DAOCS, CAS and related enzymes such as protein lysyl/prolyl hydroxylases, E.E and leucoanthocyanidin oxidases [1,2]. Using the available contextual information, we attempted to predict possible substrates of these proteins. AlkB binds ssDNA and is required for the processing of toxic DNA modifications caused by SN2 alkylating agents such as methylmethanesulfonate (MMS) specifically on ssDNA. On the basis of the preferential specificity of AlkB-dependent repair for ssDNA and for SN2 as opposed to SN1 alkylating agents, it has been proposed that its targets include modified bases such as N1-methyladenine and N3-methylcytosine [18]. The predicted 2OG-.e(II) dioxygenase activity of AlkB is probably involved in the detoxification of methylated bases in ssDNA, possibly through hydroxylation of the methyl groups, resulting in less toxic base derivatives. The hydroxylation might be followed by a second oxidative step that could remove the hydroxymethyl group, restoring the normal base in DNA. These reactions are consistent with the observation that, unlike AlkA, AlkB has no DNA glycosylase activity [18]. The most intriguing finding made during our analysis of the AlkB family is the existence of multiple eukaryotic AlkB homologs, other than the actual orthologs, especially in plants and their viruses. The specific action of AlkB on ssDNA suggests that ssRNA could be the substrate

comment reviews reports

Figure 2 A structural model of the DSBH core of the 2OG-Fe(II) dioxygenase superfamily. This is based on the Emericella nidulans isopenicillin N synthase structure (PDB:1ips). The side chains of the amino acid residues implicated in catalysis and in substrate binding are shown (see text) and the Fe(II) ion is indicated by a red circle.

deposited research

in both length and sequence between individual families (.igures 1,2). The DSBH region includes seven conserved strands that are common to all these proteins and are arranged in two sheets in a jelly-roll topology (.igures 2,3). However, different families have specific inserts in various positions between the conserved strands; some of these inserts contain additional secondary structures and show significant sequence conservation (.igures 1,3). .or example, the insert between the fifth and sixth strand in AlkB is predicted to contain an extra strand, whereas in the small-molecule dioxygenase (IPNS/ethylene-forming enzyme (E.E)) family, the same region forms (or is predicted to form) a short helix (.igures 1,3). The clavaminic acid synthase, an outlier of this latter family, has its own characteristic inserts, including a giant (approximately 70 amino acids) insert between strands 4 and 5 [8], and some members of the AlkB family have smaller inserts in the same position (.igure 1). This reflects the relative resilience of the core DSBH to insertions, and accounts for difficulties in unification of this superfamily by sequence-based methods. The multiple alignment contains at least three characteristic conserved motifs that center, respectively, at a HXD dyad near the amino terminus, a histidine towards the carboxyl terminus, and an arginine or lysine further downstream (.igure 1). The HXD

refereed research interactions information

6 Genome Biology

Vol 2 No 3

Aravind and Koonin

Isopenicillin synthase

Prolyl-4-hydroxylase

N
Variable insert

N AlkB

Variable region

Figure 3 Topological diagrams for three members of the 2OG-Fe(II) dioxygenase superfamily. The diagrams are based on the experimentally determined structures for E. nidulans isopenicillin N synthase (PDB: 1ips) and structural models of prolyl-4hydroxylase and AlkB. The amino acid residues of the active site and the Fe(II) ion are shown as in Figure 2.

for other members of this family. Given the presence of RNA methylases that could modify RNA in a similar way as the alkylating agents the function of the uncharacterized AlkB-like proteins could be to reverse such modifications. Such modifications might also be used in the host-mediated inactivation of viral RNAs, and the AlkB homologs acquired by some plant viruses could counter this host-defense mechanism. The connection between nucleic acid methylation and the AlkB homologs described here is supported by the domain architecture of a conserved pair of animal proteins (C14B1.10 from C. elegans and CG17807 from Drosophila) in which an amino-terminal AlkB-like domain is fused to a carboxy-terminal predicted methyltransferase domain. These proteins could potentially be involved in regulation of RNA stability or DNA repair via controlled methylation/demethylation. EGL-9 defines a new family within the 2OG-.e(II) dioxygenase superfamily that is highly conserved in animals and is also represented in the pathogenic proteobacteria such as

Ps. aeruginosa and Vibrio cholerae. In C. elegans, EGL-9 is required for normal egg laying, whereas loss of its function provides resistance to hypercontractile muscular paralysis caused by Ps. aeruginosa [19]. The vertebrate homolog of EGL-9 is specifically expressed in smooth muscles and is likely to have a role in their function [27]. At its carboxyl terminus, EGL-9 contains a MYND finger domain that is probably involved in specific protein-protein interactions [28]. The closest relatives of the EGL-9 family are the proline hydroxylases with which they share a region of specific extended conservation amino terminal to the core DSBH domain (.igure 1). This relationship, along with the combination to the intracellular MYND domain and the lack of signal peptides, suggests that the Egl-9 family proteins are prolyl hydroxylases that modify intracellular proteins, unlike the classic prolyl hydroxylases that have been implicated primarily in the modification of collagens in the endoplasmic lumen. The interesting aspect of the EGL-9 family is its presence in V. cholerae and Ps. aeruginosa (.igure 1), which have apparently acquired these genes by horizontal transfer

http://genomebiology.com/2001/2/3/research/0007.7

from eukaryotes. The direct connection between this gene acquisition and the action of the Ps. aeruginosa toxins on animal muscles is unclear, but it seems possible that the bacterial EGL-9-like proteins modify host proteins in a manner that favors the survival and spread of the pathogen. This might be especially pertinent if the host downregulates the endogenous ortholog in response to the infection. Leprecan is a proteoglycan that is associated with the basement membrane in chordates [20]. It and related proteins contain an amino-terminal segment rich in leucine and proline [20] and a carboxy-terminal globular part that includes the 2OG-.e(II) oxygenase domain. More distant relatives of the leprecan-like proteins include the T23K23.7 protein from Arabidopsis and a family of uncharacterized proteobacterial proteins typified by YbiX (.igure 1). These proteins are predicted to be previously unnoticed aminoacid hydroxylases that catalyze modifications of intracellular and extracellular proteins.

dioxygenase superfamily evolved from the AraC-like superfamily in bacteria through drastic sequence divergence that eliminated significant sequence similarity, followed by fixation of the modified active-site configuration. In the case of the AlkB family, a single horizontal transfer event probably resulted in their entry into the eukaryotic lineage, which was followed by adaptation to new, RNA-related roles by some paralogs. In the case of the small-molecule dioxygenase (IPNS/E.E) family, a complex web of relationships can be discerned. The E.E and the plant secondary metabolite biosynthesis enzymes flavonol synthase, leucoanthocyanidin hydroxylase and giberellin-20 oxidase show maximum diversity in the plant lineage [1,6,7]. Their close homologs are, however, also found in Pseudomonas, but not in other well-studied proteobacterial lineages. Similarly, fungi contain several enzymes involved in secondary metabolite biosynthesis, such as IPNS and DAOCS, that are distinctly related to their counterparts in actinomycetes. This patchy phyletic distribution across kingdoms is suggestive of multiple gene transfer events that have apparently led to the wide dissemination of these proteins in bacteria and eukaryotes. The close grouping of the ethylene-forming enzyme and its Pseudomonas homologs [30] to the exclusion of other members of this family in plants indicates a possible recent acquisition of the gene in Pseudomonas from its plant hosts. In Arabidopsis there is an expansion of small-molecule dioxygenases (at least 75 members), whereas Ps. aeruginosa has at least three recently duplicated members. This proliferation of the small-molecule dioxygenases is consistent with their possible role in the synthesis of secondary metabolites in these organisms [1,6,7,30]. The amino-acid hydroxylases show the greatest diversity in eukaryotes, and are represented in a number of different forms in both animals and plants. They were probably derived from smallmolecule hydroxylases early in eukaryotic evolution. In contrast, the predicted amino-acid hydroxylases of the EGL-9 family are seen sporadically, in single copies, in certain bacterial lineages such as V. cholerae and Ps. aeruginosa, suggesting a secondary horizontal transfer from the eukaryotes to bacteria.

comment reviews reports

Evolutionary implications
Sequence conservation is high within individual families of the 2OG-.e(II) dioxygenase superfamily, with specific extensions typical of each family, but low between different families (.igure 1). This observation, together with the phyletic distribution of these proteins, provides some clues to their evolutionary history. Members of this superfamily could not be detected in the Archaea despite extensive profile searches of the archaeal proteomes as well as transitive searches seeded with many divergent sequences as starting points. This suggests that, unlike bacteria and eukaryotes, in which the superfamily is widely represented, archaea do not encode bona fide members. A corollary of this is that horizontal gene transfer between bacteria and eukaryotes might have had a significant role in the evolution of this superfamily. The DSBH fold that comprises the core domain of the 2OG-.e(II) dioxygenases is also present in a large number of proteins typified by the arabinose-binding domain of the transcription regulator AraC, plant seed proteins such as vicilin, and oxalate oxidase ([29]; see SCOP [12]). Despite the lack of detectable sequence similarity to 2OG-.e(II) dioxygenases, these proteins contain a conserved HXH motif and a carboxy-terminal histidine that appear to be equivalent to the metal-chelating HXD and carboxy-terminal histidine of the latter (see above). Thus, these two protein superfamilies could have evolved from a common ancestor by acquiring distinct catalytic and ligand-binding properties. The classic AraC-like DSBH proteins appear to have a more universal distribution than the 2OG-.e(II) dioxygenases, with diverse forms represented in archaea, bacteria and eukaryotes. The 2OG-.e(II) dioxygenase superfamily is present in diverse bacteria that had probably diverged before the diversification of the eukaryotes that contain this superfamily. This, together with the phyletic distributution of the AraC and 2OG-.e(II) dioxygenase superfamilies, leads us to speculate that the 2OG-.e(II)

deposited research refereed research interactions

Conclusions
Before this study, structure determination, biochemical studies and sequence comparisons of 2OG-.e(II) dioxygenases [1,9] had elucidated their structural fold, active-site residues and reaction mechanism. Here, using sequence profile searches, we show that many other protein families contain the same constellation of active-site residues and are predicted to adopt the same fold. This allows us to predict the catalytic activity of a wide range of functionally important, but biochemically uncharacterized, proteins from eukaryotes and bacteria. In particular, we propose a specific mechanism of action in DNA repair, and possibly in RNA modification, for the AlkB protein and its homologs.

information

8 Genome Biology

Vol 2 No 3

Aravind and Koonin

Materials and methods


The Non-redundant Protein Sequence Database [21], the Expressed Sequence Tags Database (NCBI) [31] and the individual protein sequence databases of completely and partially sequenced genomes [32] were searched using the gapped version of the BLAST programs (BLASTPGP for proteins and TBLASTNGP for translating searches of nucleotide databases) [22]. Sequence profile searches were performed using the PSI-BLAST program [22]; profiles were saved using the C option and retrieved using the R option. Multiple alignments of amino acid sequences were generated using a combination of PSI-BLAST, CLUSTALW [24] and secondary structure predictions that were produced using the PHD program [25] and the PSI-PRED program [26], with multiple alignments of individual protein families used as queries. The three-dimensional structure visualization, alignment and modeling were carried out using the SWISSPDB-Viewer program [33].

References
1. 2. 3. 4. Prescott AG: A dilemma of dioxygenases: or where molecular biology and biochemistry fail to meet. J Exp Bot 1993, 44:849-861. Hegg EL, Que L Jr: The 2-His-1-carboxylate facial triad - an emerging structural motif in mononuclear non-heme iron(II) enzymes. Eur J Biochem 1997, 250:625-629. Myllyharju J, Kivirikko KI: Characterization of the iron- and 2oxoglutarate-binding sites of human prolyl 4-hydroxylase. EMBO J 1997, 16:1173-1180. Pirskanen A, Kaimio AM, Myllyla R, Kivirikko KI: Site-directed mutagenesis of human lysyl hydroxylase expressed in insect cells. Identification of histidine residues and an aspartic acid residue critical for catalytic activity. J Biol Chem 1996, 271:9398-9402. Passoja K, Myllyharju J, Pirskanen A, Kivirikko KI: Identification of arginine-700 as the residue that binds the C-5 carboxyl group of 2-oxoglutarate in human lysyl hydroxylase 1. FEBS Lett 1998, 434:145-148. Zhang Z, Barlow JN, Baldwin JE, Schofield CJ: Metal-catalyzed oxidation and mutagenesis studies on the iron(II) binding site of 1-aminocyclopropane-1-carboxylate oxidase. Biochemistry 1997, 36:15999-16007. Lukacin R, Britsch L: Identification of strictly conserved histidine and arginine residues as part of the active site in Petunia hybrida flavanone 3beta-hydroxylase. Eur J Biochem 1997, 249:748-757. Zhang Z, Ren J, Stammers DK, Baldwin JE, Harlos K, Schofield CJ: Structural origins of the selectivity of the trifunctional oxygenase clavaminic acid synthase. Nat Struct Biol 2000, 7:127-133. Valegard K, van Scheltinga AC, Lloyd MD, Hara T, Ramaswamy S, Perrakis A, Thompson A, Lee HJ, Baldwin JE, Schofield CJ, et al.: Structure of a cephalosporin synthase. Nature 1998, 394:805-809. Roach PL, Clifton IJ, Fulop V, Harlos K, Barton GJ, Hajdu J, Andersson I, Schofield CJ, Baldwin JE: Crystal structure of isopenicillin N synthase is the first from a new structural family of enzymes. Nature 1995, 375:700-704. Lange SJ, Que L Jr: Oxygen activating nonheme iron enzymes. Curr Opin Chem Biol 1998, 2:159-172. Lo Conte L, Ailey B, Hubbard TJ, Brenner SE, Murzin AG, Chothia C: SCOP: a structural classification of proteins database. Nucleic Acids Res 2000, 28:257-259. Altschul SF, Koonin EV: Iterated profile searches with PSIBLAST - a tool for discovery in protein databases. Trends Biochem Sci 1998, 23:444-447. Aravind L, Koonin EV: Gleaning non-trivial structural, functional and evolutionary information about proteins by iterative database searches. J Mol Biol 1999, 287:1023-1040.

5.

6.

7.

8. 9. 10.

15. Wei YF, Carter KC, Wang RP, Shell BK: Molecular cloning and functional analysis of a human cDNA encoding an Escherichia coli AlkB homolog, a protein involved in DNA alkylation damage repair. Nucleic Acids Res 1996, 24:931-937. 16. Chen BJ, Carroll P, Samson L: The Escherichia coli AlkB protein protects human cells against alkylation-induced toxicity. J Bacteriol 1994, 176:6255-6261. 17. Kondo H, Nakabeppu Y, Kataoka H, Kuhara S, Kawabata S, Sekiguchi M: Structure and expression of the alkB gene of Escherichia coli related to the repair of alkylated DNA. J Biol Chem 1986, 261:15772-15777. 18. Dinglay S, Trewick SC, Lindahl T, Sedgwick B: Defective processing of methylated single-stranded DNA by E. coli AlkB mutants. Genes Dev 2000, 14:2097-2105. 19. Darby C, Cosma CL, Thomas JH, Manoil C: Lethal paralysis of Caenorhabditis elegans by Pseudomonas aeruginosa. Proc Natl Acad Sci USA 1999, 96:15202-15207. 20. Wassenhove-McCarthy DJ, McCarthy KJ: Molecular characterization of a novel basement membrane-associated proteoglycan, leprecan. J Biol Chem 1999, 274:25004-25017. 21. Non-redundant Protein Sequence Database [http://www.ncbi.nlm.nih.gov/BLAST/] 22. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25:3389-3402. 23. Shanklin J, Achim C, Schmidt H, Fox BG, Munck E: Mossbauer studies of alkane omega-hydroxylase: evidence for a diiron cluster in an integral-membrane enzyme. Proc Natl Acad Sci USA 1997, 94:2981-2986. 24. Thompson JD, Higgins DG, Gibson TJ: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 1994, 22:4673-4680. 25. Rost B, Sander C: Prediction of protein secondary structure at better than 70% accuracy. J Mol Biol 1993, 232:584-599. 26. Jones DT: Protein secondary structure prediction based on position-specific scoring matrices. J Mol Biol 1999, 292:195-202. 27. Wax SD, Rosenfield CL, Taubman MB: Identification of a novel growth factor-responsive gene in vascular smooth muscle cells. J Biol Chem 1994, 269:13041-13047. 28. Masselink H, Bernards R: The adenovirus E1A binding protein BS69 is a corepressor of transcription through recruitment of N-CoR. Oncogene 2000, 19:1538-1546. 29. Gane PJ, Dunwell JM, Warwicker J: Modeling based on the structure of vicilins predicts a histidine cluster in the active site of oxalate oxidase. J Mol Evol 1998, 46:488-493. 30. Nagahama K, Yoshino K, Matsuoka M, Sato M, Tanase S, Ogawa T, Fukuda H: Ethylene production by strains of the plant-pathogenic bacterium Pseudomonas syringae depends upon the presence of indigenous plasmids carrying homologous genes for the ethylene-forming enzyme. Microbiology 1994, 140:23092313. 31. Expressed Sequence Tags Database [http://www.ncbi.nlm.nih.gov/blast/blast.cgi?Jform=0] 32. Unfinished Genomes Database [http://www.ncbi.nlm.nih.gov/Microb_blast/unfinishedgenome.html] 33. Guex N, Peitsch MC: SWISS-MODEL and the Swiss-PdbViewer: an environment for comparative protein modeling. Electrophoresis 1997, 18:2714-2723.

11. 12. 13. 14.