You are on page 1of 7

article

C. elegans ORFeome version 1.1:


experimental verification of the genome
annotation and resource for proteome-
scale protein expression
2003 Nature Publishing Group http://www.nature.com/naturegenetics

Jrme Reboul1,11*, Philippe Vaglio1*, Jean-Franois Rual1,2*, Philippe Lamesch1,2*, Monica Martinez1,
Christopher M. Armstrong1, Siming Li1, Laurent Jacotot1, Nicolas Bertin1, Rekins Janky1, Troy Moore3,11, James
R. Hudson Jr.3,11, James L. Hartley4,11, Michael A. Brasch4,11, Jean Vandenhaute2, Simon Boulton1,11, Gregory A.
Endress5, Sarah Jenna6, Eric Chevet6, Vasilis Papasotiropoulos7, Peter P. Tolias7, Jason Ptacek8, Mike Snyder8,
Raymond Huang9, Mark R. Chance9, Hongmei Lee10, Lynn Doucette-Stamm10,11, David E. Hill1 & Marc Vidal1
*These authors contributed equally to this work

Published online 7 April 2003; doi:10.1038/ng1140

To verify the genome annotation and to create a resource to functionally characterize the proteome, we attempted
to Gateway-clone all predicted protein-encoding open reading frames (ORFs), or the ORFeome, of Caenorhabditis
elegans. We successfully cloned approximately 12,000 ORFs (ORFeome 1.1), of which roughly 4,000 correspond to
genes that are untouched by any cDNA or expressed-sequence tag (EST). More than 50% of predicted genes needed
corrections in their intron-exon structures. Notably, approximately 11,000 C. elegans proteins can now be expressed
under many conditions and characterized using various high-throughput strategies, including large-scale interac-
tome mapping. We suggest that similar ORFeome projects will be valuable for other organisms, including humans.

Introduction teome, under different conditions and in various hosts, to allow


The availability of complete or nearly complete genome the development of diverse large-scale functional genomic and
sequences for a few model organisms and for humans1 consti- proteomic approaches2. For example, proteome chips9,10 and
tutes a substantial step in the development of comprehensive2 or reverse transfection strategies11 require large numbers of pro-
systems3 approaches to biology. But many challenges still lie tein-encoding ORFs to be cloned precisely into different expres-
ahead. First, all genes (both protein-coding and non-coding) sion vectors. Although well developed for unicellular organisms
will have to be identified and experimentally verified. Recent containing few, if any, introns12, such strategies are poorly
evidence suggests that this challenge might be more cumber- defined for metazoans.
some than originally anticipated. For example, expression stud- So far, the challenges of experimentally finding genes
ies by tiling arrays4 illustrate how important it will be to expressed in multicellular organisms, defining their exact
experimentally verify the original human gene predictions5. intron-exon structures and expressing their encoded proteins
Likewise, the precise intron-exon structure of many C. elegans have been addressed mainly by large-scale sequencing of ran-
predicted genes needs correction6. Finally, both comparative dom cDNAs, or ESTs, in the context of various transcriptome
genomics7 and genome-wide functional analyses8 show that the projects. Particularly, thousands of full-length cDNAs are now
Saccharomyces cerevisiae genome, despite its low content of available for human, mouse and Arabidopsis thaliana tran-
introns, also needs annotation improvements. Another impor- scripts, providing experimental evidence for roughly 50% of the
tant challenge in the development of systems approaches is to genes predicted from the genome sequence annotation of these
define strategies to express nearly all predicted proteins of a pro- organisms1315. Though extremely helpful, full-length cDNA

1Dana-Farber Cancer Institute and Department of Genetics, Harvard Medical School, Boston, Massachusetts 02115, USA. 2Unit de Recherche en Biologie
Molculaire, Facults Universitaires Notre-Dame de la Paix, Namur, 5000, Belgium. 3Research Genetics /Invitrogen, Huntsville, Alabama, USA. 4Life
Technologies /Invitrogen, Rockville, Maryland, USA. 5Protedyne Corporation, Windsor, Connecticut 06095, USA. 6Department of Surgery, McGill
University, Montreal, Canada. 7Center for Applied Genomics, Public Health Research Institute, Newark, New Jersey 07103, USA. 8Yale University, New
Haven, Connecticut 06520, USA. 9Center for Synchrotron Biosciences and Department of Physiology & Biophysics, Albert Einstein College of Medicine,
Bronx, New York 10461, USA. 10Genome Therapeutics, Waltham, Massachusetts 02453, USA. 11Present addresses: INSERM Unit 119, Institut Paoli
Calmette, 13009 Marseille, France (J.R.); Open Biosystems, Huntsville, Alabama 35806, USA (T.M.); Cityscapes, Huntsville, Alabama 35801, USA (J.R.H.);
SAIC/National Cancer Institute, Frederick, Maryland 21702, USA (J.L.H.); Atto Bioscience, Rockville, Maryland 20850, USA (M.A.B.); Cancer Research
UK, Clare Hall, Herts EN6 3LD, UK (S.B.); Agencourt Biosciences Corporation, Beverly, Massachusetts 01915, USA (L.D.). Correspondence should be
addressed to M.V. (e-mail: marc_vidal@dfci.harvard.edu).

nature genetics volume 34 may 2003 35


article
projects are limited for the following reasons. First, random primers tailed with Gateway recombinational cloning sites at the
cDNA picking limits the detection of genes expressed at rela- 5 end (referred to here as Start and Stop primer pairs). The
tively low levels. For example, the proportion of C. elegans pre- resulting PCR products are recombined unidirectionally into a
dicted genes that have been experimentally verified by ESTs is Donor vector to create Entry clones. Because the Gateway reac-
about 60% (ref. 16), even though experimental evidence tion is reversible, Entry-cloned ORFs can subsequently be trans-
strongly suggests that most of these predictions correspond to ferred efficiently by recombination into any Gateway-compatible
genuinely expressed genes6. Second, randomly picked cDNA protein-expression, or Destination, vector of interest. ORF
clones can rarely be used directly for protein expression, because sequence tags (OSTs) are also obtained for both the 5 and 3 end
their 5 end is not cloned in the appropriate reading frame, their of the Entry clones, providing experimental evidence for the exis-
3 end is not compatible with the expression of C-terminal tence and intron-exon structure of the corresponding genes.
fusion proteins or both. Finally, most cDNAs identified in tran-
2003 Nature Publishing Group http://www.nature.com/naturegenetics

scriptome projects are not available in vectors that allow the Version 1.1 of the C. elegans ORFeome
protein-encoding sequences to be transferred to a variety of We first implemented this strategy at a genome scale to experi-
expression vectors by automated, high-throughput methods. mentally verify the genome annotation of C. elegans because of
Whereas bioinformatic analysis of genome annotations is the relative simplicity of its genome (small introns and short
affected by only the first limitation, the development of pro- intergenic sequences) and the high quality of its genome
teome-wide strategies has been hampered by all three. sequence20. The C. elegans genome sequence was originally pub-
lished with a relatively low error rate of 1 nucleotide per 30 kb20
Results and is now the only complete sequence available for a multicellu-
Genome-wide ORFeome cloning lar organism16. At the start of the C. elegans ORFeome project
To simultaneously address these three problems, we designed an (August 1999), the following ORF sequences were available: 839
alternative strategy referred to as Gateway-based ORFeome ORFs previously submitted to GenBank by the worm scientific
cloning6,1719 (Fig. 1a). Briefly, predicted protein-encoding ORFs community (community ORFs), 1,340 ORFs defined experi-
are amplified by PCR precisely between the initiation and termi- mentally by the transcriptome project through overlapping
nation codons, using a cDNA library as template and specific ESTs6 (transcriptome ORFs) and 17,298 ORFs predicted by the

a GenBank transcriptome Wormpep b

WorfDB

19,477 primer pairs 5UTR O R F 3UTR

5UTR O R F 3UTR
PCR on
cDNA library
cDNA library

O R F

O R F
recombinational
cloning ccdB
Donor vector

E. coli transformation
ccdB
+
plasmid DNA O R F
miniprep Entry clone

c 20,800 predicted ORFs


sequencing (OSTs)
O R F 11,984 ORFs cloned 12,376 ORFs
Entry clone touched by ESTs
4,059 unconfirmed
phred score (QC)
4,365 OST confirmed
7,619
alignment on EST and OST
genome sequence confirmed
(Acembly)
4,757 EST confirmed
WorfDB

ORF identification
ORFeome 1.1
(~12,000 Entry clones)

Fig. 1 Gateway cloning of the C. elegans ORFeome 1.1. a, Overall scheme for Gateway cloning. ccdB, toxic marker19; yellow, attB1 and attB2; green, attP1 and
attP2; blue,attR1 and attR2; red, attL1 and attL2. b, Electrophoretic analysis of PCR products. The sizes of products from 19,477 PCR reactions that were
attempted were analyzed on 218 ethidium bromidestained gels6. Each gel picture is available at WorfDB23. c, Experimental verification of predicted ORFs.
WormBase release WS84 contains 20,800 predicted ORFs (of which 1,324 correspond to alternative splice forms). 12,376 have been verified by ESTs (yellow and
green); 11,984 have been verified by OSTs (red and green); 7,619 have been verified by both ESTs and OSTs (green) and 4,365 and 4,757 ORFs have been verified
exclusively by OSTs (red) or ESTs (yellow), respectively.

36 nature genetics volume 34 may 2003


article
C. elegans Sequencing Consortium20 using GeneFinder (Worm- consequently the worm EST coverage might seem low at first
Base version WS9). PCR reactions were done with 19,477 pairs of glance, the actual transcriptome coverage was estimated to be
Gateway-tailed ORF-specific primers using a highly representa- only modestly higher for humans (roughly 50%) than for C. ele-
tive cDNA library generated from all stages of C. elegans her- gans5. Thus, ORFeome projects might be useful to experimen-
maphrodite development and from males and dauer forms17 tally verify and clone additional ORFs in other organisms,
(Fig. 1b). After cloning the resulting PCR products into the Gate- provided that improved gene-finding algorithms are developed.
way Donor vector pDONR201, we subjected pools of approxi- Whereas ORF cloning success is evidence for the existence of
mately 501,000 transformants for each ORF to plasmid DNA the corresponding genes, cloning failures need to be interpreted
purification and OST analysis to verify both gene identity and the cautiously. First, the cloning strategy described above is limited by
presence of at least one splicing event. Such ORF pooling strategy a false negative rate of approximately 7%. This is derived from the
should maintain a diversity of splice variants in each Entry clone observation that 93% of all ORFs annotated as highly reliable,
2003 Nature Publishing Group http://www.nature.com/naturegenetics

(that is, between the Start and Stop primers). In addition, it facil- that is, the community and transcriptome ORFs, could be ampli-
itated the throughput needed to complete the project. We then fied and cloned using our high-throughput assay methods. The
re-arrayed all 11,984 ORFs that were successfully cloned and second possible explanation for cloning failures is intron-exon
sequenced to generate version 1.1 of the C. elegans ORFeome. structure mispredictions (for example, wrongly predicted exons
This version is available to the research community through can lead to the design of a primer pair that does not amplify a
MRC geneservice and Open Biosystems (see URLs). product even though the gene is expressed). This possibility is
supported by two observations: (i) 29% of ORFs already touched
Genome-wide verification of gene existence by ESTs, and thus previously identified experimentally, did not
The first outcome of the C. elegans ORFeome 1.1 is to provide clone in our assay and (ii) for a large proportion of unsuccessfully
experimental evidence for 4,365 predicted ORFs that had not yet cloned ORFs, pairs of internal primers could amplify a product
been identified by any EST in the transcriptome project whereas the Start and Stop primer pair could not6.
(untouched ORFs). This increases the number of identified C. Though our assay should not be used to rule out the existence
elegans transcripts by 35%, raising the total number of C. elegans of any particular predicted gene, trends of overall ORF cloning
genes experimentally confirmed to 16,741 (4,365 untouched efficiency (OCE) identified interesting features of C. elegans
ORFs cloned here plus 12,376 ORFs verified by ESTs (touched global chromosomal organization. First, the OCE of chromo-
ORFs), as indicated in the August 2002 release of WormBase somes V and X was lower than that of the other four chromo-
(WS84); Fig. 1c). This finding might relate to the genome anno- somes (Fig. 2a). Additionally, the OCE, though homogeneous
tation of other species and the issue of number of genes in gen- along chromosomes I, III, IV and X, was slightly biased against
eral. Indeed, unlike for Drosophila melanogaster and humans, C. one extremity of chromosome II and both extremities of chro-
elegans gene predictions did not rely primarily on the existence of mosome V (Fig. 2b). Because these three chromosomal regions
ESTs or orthologies. Thus, we propose that a substantial propor- are heavily populated with ORFs predicted to encode G pro-
tion of as yet unpredicted and untouched genes could also be teincoupled receptors (GPCRs), we compared the OCE of vari-
present in other organisms. Consistent with this idea, a recently ous functional classes (Fig. 2c). Compared to ORFs that encode
described protein trapping method in D. melanogaster supported potential transcription factors, phosphatases, kinases and others,
the notion that 44% of its genes have not yet been predicted21. In ORFs predicted to encode GPCRs were indeed cloned at a sub-
addition, although the total number of human ESTs (roughly 3 stantially lower rate. Together with other observations22, this
106) is higher than that of C. elegans ESTs (roughly 3 105) and suggests that a large proportion of ORFs predicted to encode

a b

I n = 2,884 II n = 3,551 III n = 2,508 IV n = 3,012 V n = 4,794 X n = 2,719

c
100%

90%

80%

70%

60%

50%

40%

30%

20% touched ORFs untouched ORFs


clo
oned cloned
10% touched ORFs not untouched ORFs not 10 ORFs
clo
oned cloned
0%
tf pho kin GPCR hyd oxy lgic

Fig. 2 Genome-wide verification of gene existence. a, Global representation of the OCE per chromosome. Roman numerals refer to chromosome numbers and n rep-
resents the number of predicted ORFs for each chromosome. The color code for each ORF class is as indicated for b. b, OCE distribution along the chromosomes for
each of the six chromosomes. The blue bars on the right of each chromosome represents the density of GPCRs. c, OCE across seven predicted functional categories. The
number of predicted ORFs for each Gene Ontology category is as follows: transcription factors (tf), GO:0004930, n = 457; phosphatases (pho), GO:0016302, n = 211;
kinases (kin), GO:0016301, n = 542; G-protein coupled receptors (GPCR), GO:0004930, n = 986; hydrolases (hyd), GO:0016787, n = 1,220; oxydoreductases (oxy),
GO:0016491, n = 295; extracellular ligand-gated ion channels (lgic), GO:0005230, n = 87. The color code for each ORF class is as indicated for b.

nature genetics volume 34 may 2003 37


article
GPCRs might correspond to pseudogenes. Alternatively, the When OSTs and GeneFinder ORF predictions are divergent, the
GPCR mRNAs might be expressed at a level that is under the OSTs might correspond to one particular splice variant whereas
detection threshold of our assay, or their GeneFinder predictions the predicted ORF might represent an alternative one expressed at
might be particularly problematic. a level under the detection threshold of our PCR reactions. Thus,
our OST analysis does not necessarily disprove any particular pre-
Genome-wide verification of intron-exon structure dicted ORF splice variant. The analysis also identified reading-
In addition to identifying genes in the genome sequence, genome frame differences between predicted and observed ORFs for 1,361
annotation tools such as GeneFinder also predict intron-exon of the 3,439 corrected ORFs. In such cases, although GeneFinder
structures. We used the OST analysis of the ORFeome 1.1 to ver- had mispredicted either the ATG or the stop codon, we were still
ify such predictions for C. elegans at a genome-wide scale. In able to amplify and clone part of the corresponding ORFs. To dis-
total, we used 10.7 Mb of OST nucleotide sequence to compare seminate the ORF corrections, OSTs and their interpretation have
2003 Nature Publishing Group http://www.nature.com/naturegenetics

the structure of cloned ORFs with that of predicted ORFs. Over- been made available on WormBase16 and WorfDB23 (Fig. 3b).
all, 3,439 (29%) cloned ORFs had a structure that differed from The global OCE described above, together with the ORF struc-
that of the GeneFinder predictions (Fig. 3a,b). Notably, the pre- ture corrections, allowed us to estimate the overall quality of gene
dicted intron-exon structure of more than 1,500 untouched predictions in C. elegans. Of 19,477 predicted genes, we success-
ORFs could be corrected. Exons were removed or added for 608 fully cloned and sequenced 11,984 (61.5%), among which 8,545
or 479 ORFs, respectively, or were extended or shortened for (43%) had an intron-exon structure matching the predictions. If
1,008 or 1,046 ORFs, respectively. Introns were removed or we assume, as stated above, that most cloning failures occurred
added for 684 or 505 ORFs, respectively. Although such modifi- because of intron-exon structure mispredictions, more than 50%
cations did not change the global level of orthologies between the of predicted C. elegans genes would need corrections (Fig. 3c). The
C. elegans predicted proteome and that of other organisms, we completeness and quality of the C. elegans genome sequence are
expect that they will be useful for numerous proteomic considered relatively high compared with that of other organisms1.
approaches, such as protein identification using peptide-finger- Hence, our work strongly suggests that similar genome-wide veri-
printing mass-spectrometry techniques. fication projects are urgently needed for these organisms.

a d
exon exon exon additional intron additional exon
unaltered extended shortened intron not found exon not found 0
predicted structure X 1
OST X
number of events 29,049 1,138 1,223 608 884 568 1,167
2
percentage of events 83.9% 3.3% 3.5% 1.8% 2.5% 1.6% 3.4%
500
number of ORFs 1,008 1,046 505 684 479 608
3
8,545
percentage of ORFs 71.3% 8.4% 8.7% 4.2% 5.7% 4% 5.1%

1,000
b
4

1,500
5

2,0006

7
OSTs
c corrected 2,500

18% OSTs 8
perfect match
(3,439)
44%
(8,545) 3,000

38%
(7,493)

9
no OSTs 3,500

Fig. 3 Genome-wide verification of intron-exon structure. a, Differences in the structure of cloned ORFs versus that predicted in WormBase release WS9. The
exons observed by OST can be identical to the predicted exons (exon unaltered) or of different length (exon extended or exon shortened). There can also be
additional introns inserted into predicted exons (additional intron) or missing introns merging two predicted exons into one (intron not found). Finally, OSTs can
identify exons that were not predicted (additional exon) or suggest that predicted exons do not exist (exon not found). The number and percentage of events as
well as the number and percentage of ORFs affected by each event is shown. b, Example of the graphical display available in WorfDB showing the structure of an
ORF (C10E2.5) derived from OSTs compared with the current prediction (WS84). c, Summary of gene prediction quality in C. elegans. d, OST analysis of isolated
Entry clones showing splicing variants. The panel displays 36 sequencing reads corresponding to 11 singly isolated Entry clones (black arrows on the right)
aligned against the genome sequence using Acembly and compared with three predicted splice variants of a single gene (W07B3.2). The blue boxes correspond
to GeneFinder exon predictions, numbered 19, whereas the connecting blue lines indicate predicted introns (confirmed by ESTs when highlighted in green). The
yellow bar and the vertical scale (left) represent the C. elegans genome with nucleotide positions starting at 0 on the putative ATG codon. The reads are com-
bined into observed exons (pink boxes) and introns (connecting pink lines), and the deduced structures of each alternative splice form are represented in green.
The blue lines connecting the green and pink boxes represent introns that do not satisfy the gtag or gcag rule.

38 nature genetics volume 34 may 2003


article
ORFeome 2.1 and alternative splicing internal alternative splice forms. We experimentally verified
ORFeome cloning projects can also be applied to identifying alternative exons predicted by GeneFinder and also identified
internal alternative splicing events at a genome-wide scale. To new, previously unpredicted variants (Fig. 3d). An additional
assess the relative proportion of splice variants in the C. elegans outcome of this part of the ORFeome project is the identification
ORFeome 1.1 ORF pools, we selected a set of 208 ORFs with sizes of wild-type singly isolated clones. A future version of the worm
ranging from 300 bp to 4,250 bp, and for each pool, we isolated ORFeome (ORFeome 2.1) generated by isolating colonies from
and sequenced up to 12 clones. This analysis showed that overall the ORFeome 1.1 Entry clones should help identify a wild-type
at least 10% of C. elegans genes are expressed in one or more isolate for all internal ORF splice-variants expressed in C. elegans.

Fig. 4 Functional analysis and a H12C20.2AZK856.7


b
2003 Nature Publishing Group http://www.nature.com/naturegenetics

expression of the C. elegans T28A8.7F44A6.1 R06C1.3 D1005.1 C08E3.9 100


B0336.10
ORFeome 1.1. a, Interactome W09C5.1 F20D12.1
K09E2.3 F15B9.5 K01A2.10 80

percentage of total
interactions found
T24D8.1 F47D12.3
mapping. The panel shows H28O16.1
C02F5.9
ZK20.5 K12G11.4
R13A5.8 B0281.5 60 touched ORFs touched ORFs
F53G12.10 K02E7.9
two superimposed two-hybrid F23C8.5 B0336.2 CC8.1 C30F8
Y113G7A.6A H09G03.2
Y79H2A.1 F26D10.3
maps obtained from screening Y39B6B.J
B02O5.3
K08A8.1 T02E9.2 F55D10.2F49H12.3 Y105C5B.19 ZK1098.4 Y49E10.1 W05B10.4
40
C36C9.1 F25H2.9W02G9.2 C06A8.1 F29G9.5
116 baits against our worm T22D1.9
B0252.3
ZK792.6 20
R09H10.3 C48D5.1 untouched ORFs
AD-cDNA library18,25,26 (two- T10B10.1
F23F12.6
C23H3.4 K11H3.1 C56C10.7
C16C2.3F46G10.1 C48B6.3 F54D10.7
ZC239.15
AC7.2
F28B4.2
0
untouched ORFs
T28C6.7
hybrid connections are repre- ZK1055.7
ZK867.1D
F52E1.7 F25B5.4 Y54E10BL.6
C38D4.6 cDNA library AD-ORFeome library
T23B5.1 F54F2.5
sented by black lines) and the R09B3.1 T18D3.7
C36B1.4 Y39G10AR
C52B11.2 T11B7.4A
F33H2.6 T22A3.2 D1007.12 ZK1055.1
AD-ORFeome library (two- F39H12.1
C31H1.6C32D5.1 F54D10.3 F09E5.7
CD4.6 R05D11.8 W07G4.5
M03C11.4
F57F5.1 T05E7.5
hybrid connections are repre- F55A11.1
R186.4 ZK678.1 R06F6.8B
T18H9.2 ZC155.7 C27H5.2
T08A11.1
ZK945.2
C30F2.3B0024.11 F31E3.5K12H4 T27C4.4A
sented by red lines). The C28G1.3
C53A5.3
F11E6.1 F13D12.6 F35G2.2 Y42H9AR.1T07F8.4
T21B6.3 F10C1.7B K07A1.12
B0041.6ZK930.3
complete list of interactions is T22F3.2 F56H1.4
F23F1.8
C15H11.7
R07H5.1
K05B2.3 K10G6.1 F59A2.3T06G6.3 Y54E2A.3
C35A5.9
C40A11.7
H15N14.1 T02E1.7 F52G2.3
available in Supplementary C26F1.4 F31E3.5T03E6.7 F10G7.4K08F8.1 F29G6.3BC43E11.4 F33G12.5
Y82E9BR.13
H06I04.1 M162.1 F07A5.7
Table 1 online. Overlapping Y24F12A.2
C44B7.1
K11D2.3
R03C1.2
Y110A7A.14
Y79H2A.1
F02A9.3Y87G2A.6 C06A6.1
F58A4 ZK829.7 C14B1.1F56D12.5 F35F10.12 C39D10.7 F26B1.3
connections are shown in C23G10.4A K12H4.1
T08G5.5ZK1127.4F01G10.5
C54D1.5 F44G3.9
ZK418.4 B0547.1 F38A3.1
DY3.7 D1054.2 F45G2.3 R11A8.6 F45D11.15
green. Proteins encoded by F31C3.2
R06C1.1
C34H4.2 K05C4.1 R11E3.6K12G11.3
F23F12.9
F44B9.6
W09H1.3
R05F9.1
K08B4.1
C15H9.6 H26D21.1 F10G8.8
touched and untouched ORFs T06E4.3A C07A12.4
F41H10.4
Y69H2.3A Y71A12B.G
ZK1053.5
F02A9.6 C05C10.5
K08E3.6 F46G10.1 B0024.14 F56H11.1A
are represented by black and Y119D3B Y62E10A.14 W03D2.3
C30B5.1T05C12.7
K06A4.5 T05B11.1
T10E10.4 ZK945.8T22H2.6A
F38B2.1 T22A3.3
T04A11.6 Y45F10D.13
C30C11.2
yellow closed circles, respec- F49C12.8
Y39A1A.23
K12D12.1 C05D9.1 W04D2.1A
T05C12.6A K04G2.10
F59A2.1
K09B11.9 untouched ORF
F46F2.2 Y38A8.2 Y113G7B.23
F43D9.4 F54D5.5
tively. Many of the novel inter- F19B10.1 C05C10.4B0495.5 Y41C4A.14 C47B2.4
F39H11.5
T10F2.4
W10G6.3 C06A8.5
touched ORF
C10G11.5 F49B2.5 T20F5.6
actions detected in the T06E4.6 F44D12.1
C16C8.16 C14F5.5 M6.1A
Y57G11C.22 W03D2.4T06E4.3A T22H2.5B
C31H1.2 F46F6.1A T23H4.2 cDNA library
Y119C1A.1
AD-ORFeome screens identi- Y43F11A.5 Y57G11C.24C
D1037.3
M02A10.3A interaction
ZK632.7 C06G3.6 F46F6.1B
ZK1248.3
fied potential links that were F31E3.3
C54G10.2
K03H1.10 C04F12.3
Y38A10A.5 T01C3.3 C11E4.6 R08E3.3Y105E8B.5 C49A9.6 W05H7.4
K04G7.1
T07E3.5 AD-ORFeome library
T28F12.3
not identified by the AD- F14D12.4 F55A11.3 K07D4.3
H32C10.1
interaction
F32A11.2 C39E9.13C05D11.11A T10H10.1 K07H8.6 C44B12.5 F10E9.3
F42A10.2 F14F3.2
cDNA library screens. For K09A11.1
K10B3.8
C35B1.1 C09D4.5
K04H4.2B
T11B7.4B W10D9.3 cDNA and AD-ORFeome
C53A5.6 ZK652.9 library interaction
R05D3.4 K10B3.7 T02E1.3B
example, three interactors T06D8.8 ZC434.2
Y51A2D.17
F17E5.1A
C09H6.2A F32D1.1 F08C6.7
C47E8.5 F21F8.7
C50F4.11 C02C2.1 C53C7.3
(C53A5.6, Y71F9AL.10 and Y54G2A.31
T24D1.3 R12E2.3 F29B9.6 M04B2.1
T02E1.3A
ZC477.9A
F43C1.2A
K11E8.1C
F59A2.5
T24D1.3) found with UBC-13 Y47D3A.G
R07E5.8
C17E4.2Y66H1A.6
F55B11.3
C02F12.4 C05C8.1
F38A6.3A
R119.7 T08D10.1
F59C6.5 F29G9.2
Y71F9AL.10 Y43C5A.6 C34E10.8 C33H5.12A Y17G7B.4
(Y54G2A.31) are predicted to C01G5.6
F45C12.7 K08E3.5A C09G1.4
C06C3.1
F47B10.2
ZK20.3 C32F10.2 W06D4.6
contain a RING finger domain, K05G3.3
F25H5.4 Y116A8C.13
R01H10.5 C44F1.2
F08B6.4A M02D8.2 C49C3.7
ZK1240.2 Y77E11A.4 B0432.8 F39B2.2 C30A5.2 F42C5.10
R10D12.14
a structure frequently found B0205.3 R02F2.5
T27F2.2
C02F05.7A F20H11.5 F55B12.3A F41H10.3 F42H10.7
K12D12.5
in proteins involved in ubiqui- F25H2.5
Y15E3A.1
K06A1.4
Y76B12C.2M03C11.8 F46A9.5
tination. These interactors may
encode proteins that function
as degradation-target speci- c
50% success 100% success 41% success
ficity factors (similar to F-box 60 kDa -
percentage

K12C11.2 C36A4.8 Y39A1C.3


intensity

1
mw 2 3 4 1 2 3 4 1 2 3 4 mw mw

proteins) specifically involved 48,314.79 45 kDa - *


* * * * * 100 kDa -
in the DNA-damage response. 24,331.71
50 kDa -
Another interesting example is
20,000 42,000 64,000 SDSPAGE/MS/MS
the interaction of RAD-23
(ZK20.3) with ZK1240.2. RAD- mass (m/z) GST
23 is required for nucleotide MALDI O R F
E. coli SDSPAGE
excision repair and has an His6 MBP
ubiquitin-associated domain. O R F O R F
Again, ZK1240.2 is predicted E. coli E. coli
to contain a RING-finger 6 ORFs
domain, which suggests that 68 ORFs 68 ORFs
this gene product functions as O R F
an E3 ligase with RAD-23. Entry clone 61% success
77% success
b, Distribution of the touched 1 2 3 4 5 6 7 8 9 10 11 12 13

(black) and untouched (yel- 79 ORFs


* * * * - 98 kDa
* * * * * * * - 62 kDa
low) ORFs identified as interac-
GST His6
tors using either the AD-cDNA O R F
or the AD-ORFeome library.
S. cerevisiae
c, Protein expression of the
C. elegans ORFeome 1.1. Using
a 96-well plate setting, western blot
protein chip
ORFeome 1.1 ORFs were trans-
ferred from their Entry clone
to different Destination vectors9,28, thereby creating fusions with one or two different tag-encoding sequences (His6, MBP or GSTHis6). The resulting proteins were
expressed using the species indicated as a host (the complete list of expressed proteins is available in Supplementary Table 2 online). For each tag, the identification tech-
niques used were as follows. Matrix-assisted laser desorption ionization (MALDI) mass spectrometry was used for His6-fused proteins. SDSPAGE was used for MBP-fused
proteins. A sample of three proteins is shown. For each protein, lane 1 corresponds to uninduced conditions (pre-IPTG), lane 2 corresponds to induced conditions (post-
IPTG), lane 3 corresponds to the insoluble fraction and lane 4 corresponds to the soluble fraction. Western-blot and protein-chip analysis using an antibody against GST
were used for GSTHis6. Thirteen proteins are shown (1, C04F12.3; 2, K12H4.1; 3, Y57G11C.9; 4, T05C12.7; 5, C14C11.6; 6, F11E6.1; 7, C06G3.6; 8, C07A12.4, 9, C14B1.1; 10,
C05D11.11; 11, C27D9.1; 12, C05C8.7; 13, H14N18.1) and are indicated by an asterisk when expressed at the expected size. The protein-chip image shows an area of the
chip with 41 spotted proteins (each spotted twice in triplicate). The yellow rectangle shows a positive control with GST alone at a concentration of 500 ng l1. The green
rectangle shows a negative control containing buffer only. In a low-throughput setting, six GST-fused GTPases were expressed and GST affinity-purified. SDSPAGE
detected all six GTPases at the appropriate molecular weight (shown by an asterisk) and their identity was verified by liquid chromatography coupled with quadrupole
time-of-flight mass spectrometry (MS/MS).

nature genetics volume 34 may 2003 39


article
ORFeome 1.1 and interactome mapping From this analysis, we estimate that it is now possible to produce
In addition to providing large-scale experimental evidence for roughly 8,800 proteins (83% of 10,623 in-frame cloned ORFs)
genome annotations, ORFeome projects are valuable for large- from the ORFeome 1.1 for biochemical genomic approaches such
scale functional genomic approaches. For example, two-hybrid- as proteome chips. Indeed, 77% of the GSTHis6 fusion proteins
based proteome-wide proteinprotein interaction (interactome) spotted on a chip (as described in ref. 9) were detected using an
mapping projects, though advanced in unicellular organisms24, antibody against GST. Our overall success is similar to that
could benefit from cloned ORFeome resources in metazoans. We described previously for high-throughput protein production
tested this concept in the context of our C. elegans interactome with or without Gateway vectors28,29, suggesting that the C. elegans
project18,25,26. Using the Gateway technology, we transferred all ORFeome 1.1 is suitable for large-scale protein expression. In
cloned ORFs into a two-hybrid Destination vector downstream of addition to their use in large-scale expression systems, we have dis-
the sequence encoding the activation domain (AD) and pooled all tributed Entry-cloned ORFs to dozens of laboratories already for
2003 Nature Publishing Group http://www.nature.com/naturegenetics

resulting clones. We then used this AD-ORFeome library to use in smaller scale one protein at a time approaches.
carry out two-hybrid screens against 116 bait proteins previously
used with an AD-cDNA library18,25,26 (Fig. 4a). When compared Discussion
with AD-cDNA screens, AD-ORFeome screens reached satura- In summary, the C. elegans ORFeome 1.1 suggests that genome
tion, as defined by the percentage of interactors identified more annotation tools, such as GeneFinder, can be accurate in identi-
than once in a given screen, with considerably fewer yeast trans- fying potential genes. Indeed, most C. elegans genes originally
formants (2 105 transformants for AD-ORFeome versus 2 106 predicted without evidence from cDNA or orthology are
for AD-cDNA). Hence, the throughput of the worm interactome expressed and spliced. Nevertheless, a third of the predicted
map can now be increased at least ten-fold. In addition, the AD- ORFs could not be cloned in our assay owing to GeneFinder mis-
ORFeome screens detected relatively more untouched ORFs than predictions at the boundaries of the ORFs. Future versions of the
the AD-cDNA screens (26% versus 7%; Fig. 4b). Finally, although C. elegans ORFeome could be generated using improved gene
the average number of potential interactors per bait is lower for models by comparative analysis with other genome sequences
AD-ORFeome screens (0.9) than for AD-cDNA screens (3.5; this (for example, between the C. elegans (WormBase) and
can be explained by the fact that many two-hybrid interactors can Caenorhabditis briggsae genome sequences) or by experimental
only be detected in the context of partial domains), it is substan- corrections of ORF extremities (for example, by using splice-
tially higher than for previously described yeast AD-ORFeome leader sequences as anchors for 5 end PCR reactions). Alto-
screens (0.130.26; ref. 24). Thus, in addition to allowing a higher gether, this information should allow the design of new Start and
throughput, worm AD-ORFeome screens should lead to a rea- Stop primer pairs for future versions of the C. elegans ORFeome.
sonably comprehensive interactome map for C. elegans. Many of In addition, approximately one third of GeneFinder-predicted
the novel interactions detected in the AD-ORFeome screens iden- ORFs that were successfully cloned here needed correction of
tified potential links that were not identified by the AD-cDNA intron-exon structure. We propose that many proteomic applica-
library screens (Fig. 4a). Similar ORFeome libraries can also now tions will benefit from these corrections, particularly in organ-
be generated using additional Destination vectors to do other isms for which the current genome sequence is of poorer quality.
proteome-wide genetic or biochemical assays27. Similar ORFeome projects in those organisms should be com-
plementary to current transcriptome projects to verify the exis-
ORFeome 1.1 and proteomics tence of all genes, identify their intron-exon structure and
ORFeome projects are also valuable for large-scale proteomic splicing variants and point to features of chromosomal organiza-
approaches. For example, comprehensive use of the protein chip tion, assuming that improved genome annotations become avail-
technology has been limited so far to the yeast proteome9, mostly able. Finally, in contrast to current transcriptome projects, the
because of a limited availability of similar ORFeome resources for resulting Entry clone resources should be immediately useful for
metazoan organisms. Hence, a crucial aspect of the C. elegans high-throughput expression and functional characterization of
ORFeome 1.1 is the versatility with which ORFs can now be trans- the proteome of many organisms in many different settings.
ferred to many different Destination vectors to be expressed for
various functional, biochemical and structural genomic analy- Methods
ses18,19. To investigate to what extent the C. elegans ORFeome 1.1 Gateway cloning of the C. elegans ORFeome 1.1. We retrieved known or
can be used for high-throughput protein expression in different predicted ORF sequences from GenBank, the Transcriptome project or
formats, we used subsets of ORFs found as potential interactors of Wormpep (WormBase version WS9), identified overlaps and introduced
baits involved in the DNA-damage response26. Using an auto- the resulting 19,477 distinct ORFs into WorfDB23. We designed 19,477
primer pairs using OSP30 and used them to amplify by PCR each ORF
mated 96-well plate setting, we transferred these ORFs from the individually from a cDNA library17 as described6,18. We designed Start
Entry vector to various Destination vectors and expressed them in primers without the ATG codon (that is, the first 5 specific nucleotide used
Escherichia coli and in S. cerevisiae as fusion proteins (Fig. 4c). We in the primers is the T of ATG) to preclude internal translation initiation
first expressed 68 proteins in E. coli as N-terminal fusions to the events in the production of N-terminal fusion proteins. We designed Stop
maltose binding protein (MBP) and examined protein expression primers without the last two bases of the termination codon so that C-ter-
by SDSPAGE. In 41% of the cases, we observed a band of the minal fusion proteins could be produced from the ORFeome resource. To
appropriate size. We next expressed the same proteins in E. coli as facilitate the PCR product size analysis and to more conveniently adjust the
N-terminal fusions to a hexa-histidine tag (His6) and tested them PCR reaction elongation times, we organized samples in order of increas-
by mass spectrometry. In 50% of the cases, we detected proteins of ing size of the predicted ORFs6. We cloned the resulting PCR products into
the Entry vector pDONR201 by Gateway recombinational cloning tech-
the appropriate size. Finally, we expressed 79 proteins in yeast cells
nology18,19 and archived them as both bacterial glycerol stocks and plasmid
as N-terminal fusions to tandem glutathione S-transferase (GST) DNA mini-preps. We used 96-well plates and liquid handling systems
protein and His6 and tested them by western-blot analysis using (robotics methods and protocols will be described elsewhere).
an antibody against GST. In 61% of the cases, we observed a band During the first phase of the project (including approximately 7,000
of the appropriate size. Overall, 83% of 58 ORFs that had been ORFs), we carried out the antibiotic selection step by spotting transformants
transferred in these three Destination vectors were expressed in at on solid media 96 ORFs at a time. This facilitated visual estimation of cloning
least one setting. efficiency. In general, cloning efficiency was inversely proportional to ORF

40 nature genetics volume 34 may 2003


article
size. Using high-efficiency chemically competent cells (>108 colonies per g Acknowledgments
DNA in a 96-well format), the range of transformant numbers varied We thank the C. elegans Sequencing Consortium for the genome sequence; the
between 1,000 colonies per ORF for small ORFs (up to 500 pb) and 50 participants of the annual ORFeome meeting for their input and numerous
colonies per ORFs for larger ORFs (>3 kb). To increase our throughput dur- suggestions; the members of M.V.s laboratory for their input and help; C.
ing the second phase of the project, we carried out the antibiotic selection McCowan for administrative assistance; B. Sobhian, A.-S. Nicot, N. Tzellas
using liquid selections in a 96-well format. After allowing overnight growth in and the GenomeVision Service sequencing staff at Genome Therapeutics for
selective medium, we did PCR on the pools of transformants to verify insert technical assistance; and P. Braun for the protein expression plasmids. This
sizes. The resulting PCR product was purified and used for sequencing. work was supported by grants from the National Cancer Institute, the National
Human Genome Research Institute, the National Institute of General Medical
Sequencing and bioinformatic analysis. We sequenced the ORFs at both Sciences and the Merck Genome Research Institute awarded to M.V.
their 5 and 3 ends, leading to two OSTs6. The OSTs were quality-con-
trolled with a phred score of at least 20 over a minimum of 200 bases. The
2003 Nature Publishing Group http://www.nature.com/naturegenetics

Competing interests statement


ORFs for which one or both OSTs did not meet this criterion and for which The authors declare competing financial interests. Details accompany the
PCR products were detectable were resequenced once. A small number of paper on the Nature Genetics website
ORFs that had not been successfully sequenced after these two cycles were (http://www.nature.com/naturegenetics).
not included in the ORFeome analysis. We aligned OSTs on the C. elegans
genome sequence using Acembly. The combined sequence information
Received 3 January; accepted 14 March 2003.
from 5 and 3 OSTs covered the ORF sequence fully or partially for 6,335
or 5,649 genes, respectively (73% of all worm predicted ORFs are less than 1. Mardis, E., McPherson, J., Martienssen, R., Wilson, R.K. & McCombie, W.R. What is
1.5 kb in length). The OST analysis identified no splicing event for 683 finished, and why does it matter. Genome Res. 12, 669671 (2002).
2. Vidal, M. A biological atlas of functional maps. Cell 104, 333339 (2001).
ORFs. For 385 of these, GeneFinder had not predicted any splicing. For 298 3. Ideker, T., Galitski, T. & Hood, L. A new approach to decoding life: systems
of them, GeneFinder had predicted one or more splicing events; however, biology. Annu. Rev. Genomics Hum. Genet. 2, 343372 (2001).
either the OSTs did not extend all the way to the splicing site (135 ORFs) or 4. Kapranov, P. et al. Large-scale transcriptional activity in human chromosomes 21
the splicing event had been wrongly predicted (163 ORFs). and 22. Science 296, 916919 (2002).
5. The International Human Genome Sequencing Consortium. Initial sequencing
Of 11,984 cloned ORFs, 3,439 (29%) showed structure differences com- and analysis of the human genome. Nature 409, 860921 (2001).
pared to the GeneFinder prediction. We expect that more differences will 6. Reboul, J. et al. Open-reading-frame sequence tags (OSTs) support the existence
be found as internal OSTs from those ORFs longer than 1,200 bp become of at least 17,300 genes in C. elegans. Nat. Genet. 27, 332336 (2001).
7. Blandin, G. et al. Genomic exploration of the hemiascomycetous yeasts: 4. The
available. As expected, the rate of ORF sequence correction was lower for genome of Saccharomyces cerevisiae revisited. FEBS Lett. 487, 3136 (2000).
touched ORFs (25%) than for untouched ORFs (35%). This difference 8. Oshiro, G. et al. Parallel identification of new genes in Saccharomyces cerevisiae.
reflects the fact that ESTs were used in consecutive versions of WormBase Genome Res. 12, 12101220 (2002).
to refine initial GeneFinder intron-exon structures predictions. 9. Zhu, H. et al. Global analysis of protein activities using proteome chips. Science
293, 21012105 (2001).
The OST analysis also identified reading-frame differences between pre- 10. MacBeath, G. & Schreiber, S.L. Printing proteins as microarrays for high-
dicted and observed ORFs for 1,361 of the 3,439 corrected ORFs. Such throughput function determination. Science 289, 17601763 (2000).
mispredictions are due to cases in which wrongly predicted ATG or stop 11. Ziauddin, J. & Sabatini, D.M. Microarrays of cells expressing defined cDNAs.
Nature 411, 107110 (2001).
codons are actually located in either the 5 or the 3 UTR or in the coding
12. Gera, J.F., Hazbun, T.R. & Fields, S. Array-based methods for identifying
sequence but in a frame different from that predicted originally. The corre- proteinprotein and proteinnucleic acid interactions. Methods Enzymol. 350,
sponding Entry clones, referred to as out-of-frame in WorfDB, cannot be 499512 (2002).
used for protein expression but are useful for transfer to RNAi vectors or 13. Mammalian Gene Collection (MGC) Program Team. Generation and initial
analysis of more than 15,000 full-length human and mouse cDNA sequences.
microarray analyses. As a result, the total number of in-frame cloned ORFs Proc. Natl. Acad. Sci. USA 99, 1689916903 (2002).
in ORFeome1.1 that are useful for protein-based studies is 10,623 (11,984 14. The FANTOM Consortium and the RIKEN Genome Exploration Research Group
minus 1,361). The complete list of successfully cloned and sequenced Phase I & II Team. Analysis of the mouse transcriptome based on functional
annotation of 60,770 full-length cDNAs. Nature 420, 563573 (2002).
ORFs is available on WorfDB23.
15. Seki, M. et al. Functional annotation of a full-length Arabidopsis cDNA collection.
When analyzing OSTs of isolated clones, we observed several clones that Science 296, 141145 (2002).
had one intron sequence that did not satisfy to the gtag or gcag rule. 16. Stein, L., Sternberg, P., Durbin, R., Thierry-Mieg, J. & Spieth, J. WormBase:
With no evidence that these splicing events are naturally occurring, we did network access to the genome and biology of Caenorhabditis elegans. Nucleic
Acids Res. 29, 8286 (2001).
not count them as alternative splice forms. 17. Walhout, A.J. et al. Gateway recombinational cloning: application to the cloning
of large numbers of open reading frames or ORFeomes. Methods Enzymol. 328,
Identification of wild-type singly isolated clones. We analyzed 1.6 Mb of 575592 (2000).
18. Walhout, A.J. et al. Protein interaction mapping in C. elegans using proteins
sequence from isolated entry clones and observed a misincorporation rate involved in vulval development. Science 287, 116122 (2000).
of 1 in 1,232 bp. Based on this, we estimate that 315 isolates will be need- 19. Hartley, J.L., Temple, G.F. & Brasch, M.A. DNA cloning using in vitro site-specific
ed for ORFs between 500 bp and 2 kb (covering 85% of all predicted ORFs recombination. Genome Res. 10, 17881795 (2000).
20. The C. elegans Sequencing Consortium. Genome sequence of the nematode C.
in C. elegans) to uncover at least one wild-type clone with 95% confidence. elegans: a platform for investigating biology. Science 282, 20122018 (1998).
Practically, in a sample of 457 pools of Entry clones (ORFs size ranging 21. Morin, X., Daneman, R., Zavortink, M. & Chia, W. A protein trap strategy to detect
from 100800 bp), we isolated and sequenced two colonies for each ORF. GFP-tagged proteins expressed from their endogenous loci in Drosophila. Proc.
We found at least one wild-type among two Entry clone for 401 ORFs Natl. Acad. Sci. USA 98, 1505015055 (2001).
22. Harrison, P.M., Echols, N. & Gerstein, M.B. Digging for dead genes: an analysis of
(87%). We expect that ORFs longer than 2 kb will require the development the characteristics of the pseudogene population in the Caenorhabditis elegans
of alternative strategies. For example, the use of isolated full-length cDNAs genome. Nucleic Acids Res. 29, 818830 (2001).
as template or improved PCR polymerases should be considered. 23. Vaglio, P. et al. WorfDB: the C. elegans ORFeome Database. Nucleic Acids Res. 31,
237240 (2003).
24. Hazbun, T.R. & Fields, S. Networking proteins in yeast. Proc. Natl. Acad. Sci. USA
Yeast two-hybrid screens. We carried out all yeast two-hybrid screens 98, 42774278 (2001).
essentially as described18. 25. Davy, A. et al. A proteinprotein interaction map of the Caenorhabditis elegans
26S proteasome. EMBO Rep. 2, 821828 (2001).
26. Boulton, S.J. et al. Combined functional genomic maps of the C. elegans DNA
URLs. Wormbase, http://www.wormbase.org; MRC geneservice, http:// damage response. Science 295, 127131 (2002).
27. Kinoshita, N., Minshull, J. & Kirschner, M.W. The identification of two novel
www.hgmp.mrc.ac.uk/geneservice/; Open Biosystems, http://www.open-
ligands of the FGF receptor by a yeast screening method and their activity in
biosystems.com/; WorfDB, http://worfdb.dfci.harvard.edu; C. briggsae Xenopus development. Cell 83, 621630 (1995).
genome sequences, http://www.sanger.ac.uk/Projects/C_briggsae/; Acem- 28. Braun, P. et al. Proteome-scale purification of human proteins from bacteria.
bly, ftp://ftp.ncbi.nlm.nih.gov/repository/acedb/ACEMBLY/index.html. Proc. Natl. Acad. Sci. USA 99, 26542659 (2002).
29. Hammarstrom, M., Hellgren, N., van Den Berg, S., Berglund, H. & Hard, T. Rapid
screening for improved solubility of small human proteins produced as fusion
proteins in Escherichia coli. Protein Sci. 11, 313321 (2002).
Note: Supplementary information is available on the Nature 30. Hillier, L. & Green, P. OSP: a computer program for choosing PCR and DNA
Genetics website. sequencing primers. PCR Methods Appl. 1, 124128 (1991).

nature genetics volume 34 may 2003 41

You might also like