You are on page 1of 4

PERSPECTIVES

GENOMICS microarrays containing sequences of entire


chromosomes shows that sizable fractions

Defining Genes of the chromosomes are stably expressed


(10, 11). However, the function, if any, of
many of these transcribed regions is not
in the Genomics Era known. Conversely, there appear to be con-
served ORFs that are not transcribed and
Michael Snyder and Mark Gerstein whose RNA or protein products have not yet
been identified (see the figure).
genome is defined as the entire col- with genes that undergo an appreciable Gene inactivation. One method for as-

A lection of genes encoded by a partic-


ular organism. But what is a gene?
Historically, the term gene, attributed to
amount of RNA splicing often have small
exons sandwiched between large introns,
making ORFs especially difficult to find.
certaining a gene’s function is to mutate or
inactivate its product (12). This can be ac-
complished by direct gene disruption or
Johansson, first appeared in the early 1900s Sequence features. Once an ORF is RNA interference. However, many coding
as an abstract concept to explain the heredi- identified, codon bias often is used to deter- sequences make products whose inactivation
tary basis of traits (1, 2). Phenotypic traits mine whether the ORF is a gene (4). The val- does not result in an obvious phenotype. For
were ascribed to hereditary factors even ue of this measure stems from the fact that instance, only one-sixth of yeast genes are

Downloaded from www.sciencemag.org on January 14, 2015


though the physical basis of those factors genes, particularly highly expressed genes, essential, and mutations in the remainder
was not known. Subsequently, early genetic exhibit biased nonrandom use of codons. usually do not cause an obvious phenotype
studies by Morgan and others associated However, for many genes, the bias is weak, as long as the yeast are grown in rich medi-
heritable traits with specific chromosomal and small ORFs (or exons) contain too few um (13) (see the figure). Presumably, this re-
regions. In the 1930s, Beadle introduced the codons to exhibit statistically significant flects functional redundancy among gene
concept of “one gene, one enzyme,” which bias. Beyond overall bias, one can also look products, assay sensitivity, or the failure to
later became “one gene, one polypeptide.” for specific patterns in the DNA sequence find conditions under which the product is
With the advent of recombinant DNA such as splice sites to help locate genes (5). useful. Thus, many, if not most, genes are
and gene cloning, it became possible to Computer programs that use DNA sequence difficult to identify solely by inactivation.
combine the assignment of a gene to a spe- features alone predict fewer than 50% of Beyond these five criteria, there are ad-
cific segment of DNA and the production of exons and 20% of complete genes (5). ditional issues in gene identification: over-
a gene product. Although it was originally Moreover, while both the existence of an lap, alternative splicing, and pseudogenes.
presumed that the final product was a pro- ORF and favorable sequence features may There are now examples of overlapping
tein, the discovery that RNA has structural, imply the presence of a gene product, they reading frames of protein-coding genes,
catalytic, and even regulatory properties say nothing about that product’s function. overlapping transcriptional units (for ex-
made it evident that the end product could Sequence conservation. In contrast to ample, where the exon of one gene is en-
be a nucleic acid (3). Thus, we now define a focusing on an individual DNA sequence, coded within the intron of another), and
gene in molecular terms as “a complete genes can be identified by comparing multi- even overlapping protein-coding and RNA-
chromosomal segment responsible for mak- ple sequences among organisms (4, 5). coding genes (14, 15). In all cases of gene
ing a functional product.” This definition DNA sequence conservation among species overlap, each gene has a unique functional
has several logical components: the expres- is an excellent method to gauge the impor- sequence and thus is distinct.
sion of a gene product, the requirement that tance of the gene product. However, con- What about products from alternatively
it be functional, and the inclusion of both served sequences could be nontranscribed spliced genes? In the human genome, more
coding and regulatory regions. According to regulatory elements. Another problem with than half the genes have spliced isoforms,
this definition, it should be possible to use using conservation to find genes is that it and this is likely to be an underestimate be-
straightforward criteria to identify genes in requires sequences of related organisms cause not all variants have been identified
the DNA sequence of a genome. Five such that are separated by appropriate evolution- (16, 17). Gene products from alternatively
criteria are in common use, but their appli- ary distances. A current estimate of the spliced messenger RNAs (mRNAs) have
cation is not straightforward. number of genes in an organism can never functionally unique and distinct sequences.
Open reading frames (ORFs). An ORF be an absolute, unchanging number, be- A comprehensive system for describing such
is a string of codons bounded by start and cause it is contingent on the specific relat- variants is lacking. Ultimately, it may be bet-
stop signals, where codons are nucleotide ed organisms used for comparison. ter to define a molecular parts list based on
triplets encoding amino acids. An obvious Evidence for transcription. A non–se- functional protein domains (the protein “do-
way to find protein-coding genes is through quence-based approach for identifying mainome”) rather than whole genes.
identifying large ORFs in the genome. This genes is to search for RNA or protein ex- The definition of a gene is also inextri-
is particularly applicable to prokaryotes and pression, the hallmark of a gene product. cably linked with the definition of a
other organisms with few introns (the re- This is commonly accomplished using mi- pseudogene (or dead gene) (18). Pseudo-
gions spliced out of RNA) in their genes. croarray hybridization, serial analysis of genes are similar in sequence to normal
Even so, many genes are short and difficult gene expression (SAGE), cDNA mapping, genes, but they usually contain obvious dis-
to identify in this way. Moreover, organisms or sequencing of expressed sequence tags ablements such as frameshifts or stop
(6–8). Large-scale tagging of genes with codons in the middle of coding domains.
transposons reveals many new regions in the This prevents them from producing a func-
M. Snyder is in the Department of Molecular, Cellular,
and Developmental Biology, and both authors are in the
yeast genome that are capable of producing tional product or having a detectable effect
Department of Molecular Biophysics and Biochemistry, proteins (9) (see the figure). Likewise for on the organism’s phenotype. Pseudogenes
Yale University, New Haven, CT 06520, USA. humans, hybridization of labeled cDNAs to occur in a wide variety of animals, fungi,

258 11 APRIL 2003 VOL 300 SCIENCE www.sciencemag.org


PERSPECTIVES
plants, and bacteria. They can be quite
prevalent; for example, there are 80 riboso- 137 new shORF =
mal protein genes in the human genome, 101 tORF + 36 hORF 221 dORF '03
versus >2000 associated pseudogenes (19). 6274 initial
6317
The boundary between living and dead No 283 qORF 6128 current
genes is often not sharp. A pseudogene in evidence 536 other ORF 5645 '97
one individual can be functional in a differ- of 5350
transcription 190 of hORF
ent isolate of the same species. For example, 4800
FLO8 is active in one strain of yeast but in-
active in another (20), and so technically is 2060 hORF
a gene only in one strain. Moreover, pseudo-
genes can be transcribed (21). Conversely,
there are other pseudogenes that have entire
coding regions without obvious disable-
Estimates
ments but do not appear to be expressed,
6274 initial ’97
such as, human ribosomal pseudogenes 2289 kORF
(19); presumably, they lack the regulatory but not eORF 6128 current best
elements required for transcription. Total MIPS ORF (’97–’03)
As a practical example of the current Range of SGD (’97–’03)
state of defining genes, consider the genome Other
of the budding yeast Saccharomyces cere-
1106 eORF
visiae. This genome was one of the first to
be sequenced, and it remains the best
characterized in terms of functional ge-
nomics (which defines the functions of
140,000 ~Overlaps
each gene product). Furthermore, its
genes undergo only a small amount of 120,000
Overlaps Microarray
splicing. Consequently, it is the organism 1212
for which we have the clearest grasp of 100,000
Potential ORFs

which DNA sequences are genes. When the 711 2033


yeast genome was first sequenced, all ORFs 80,000
longer than 100 codons were named, result- 1529
188 163
ing in 6274 possible genes (22). This num- 60,000 106
ber has been considerably revised since then Transposon SAGE
(see the figure). More small genes have 40,000
been identified (9) either through new ho-
20,000
mologies found in databases or through evi-
dence of transcription. In addition, 283 0
genes have been moved into the realm of ORFome + <100aa + CAI + Transposon + SAGE
“questionable ORFs” because they lack any
Progressive selection
evidence of transcription, function, or se-
quence conservation (23). Finally, a small Genes, ORFs, and ‘omes (Top) The initial published yeast genome claimed 6274 genes (22), but this
number of pseudogenes have been found in has been revised many times since then. The time series data on numbers of genes are based on the
the laboratory strain of S. cerevisiae, some SGD and MIPS databases: http://genome-www.stanford.edu/Saccharomyces and http://mips.gsf.de/
of which may be functional in other yeast proj/yeast/CYGD/db. These databases use different criteria for ORF inclusion: MIPS adds all candi-
strains (22). date ORFs whereas SGD limits inclusion. Also shown are other estimates for the number of genes in
For yeast, the assignment of short ORFs the yeast genome (26–29). The central column shows the types of ORFs in the current yeast anno-
has been particularly difficult. From the raw tation. These include eORF (essential ORF) (13), kORF (known ORF with a well-characterized func-
genome sequence, one can systematically tion), hORF (ORF validated by homology only), shORF (short ORF), tORF (transposon identified
define the universe of all possible (potential- ORF), qORF (questionable ORF), and dORF (disabled ORF or pseudogene) (21). (The numbers are
ly overlapping) ORFs—what we call the based on the ORF classes defined in the MIPS database.) Compared with the initial annotation, the
“ORFome”—and then examine the evidence current estimate of 6128 ORFs reflects two opposing trends: (i) the addition of new shORFs (9)
found either through transcription experiments (tORFs) or from sequence comparisons with pro-
that each encodes a protein (see the figure).
teins newly deposited in the databases (hORFs); (ii) the removal of qORFs with no evidence of be-
Overall, there are >100,000 possible ORFs
ing transcribed (that is, lacking SAGE or transposon tags, and not expressed on microarrays) and with
that are longer than 15 codons. This number no sequence similarity to any other protein. (For simplicity, we include in the qORFs 10 ORFs asso-
is constrained only slightly by codon bias, ciated with Ty elements in the original annotation. Further information is at http://bioinfo.mbb.
but it drops dramatically when evidence of yale.edu/genome/yeast/orfome.) (Bottom left) The explosion in defining shORFs. The first bar de-
transcription is included. However, each picts the potential ORFs in the raw DNA sequence of the yeast genome that are >15 codons. The
transcription experiment does not provide in- second bar shows the large number that are also <100 codons in length. The third bar demonstrates
formation about every possible gene in a that the number of ORFs is not reduced by requiring a high codon adaptation index (CAI > 0.11). The
genome. Thus, one obtains the strongest sig- remaining bars illustrate how the number of potential ORFs is radically reduced by selecting only
nal when one combines multiple different those shORFs that show evidence of transcription (transposons and SAGE). (Bottom right)
sources of information. That is, the likeli- Functional genomics information is best used in a combined fashion. Illustrated is the number of
hood that a gene encodes a functional prod- ORFs in the yeast genome that are transcribed according to data from microarray hybridization,
uct is best weighed using multiple criteria. SAGE, and transposon tagging.

www.sciencemag.org SCIENCE VOL 300 11 APRIL 2003 259


PERSPECTIVES
The yeast genome is, of course, far sim- from cross-genome comparisons, we can 7. P. Brown, D. Botstein, Nature Genet. 21, 33 (1999).
pler than the human genome, and we expect streamline the process. Ultimately, we believe 8. V. Velculescu et al., Cell 88, 243 (1997).
9. A. Kumar et al., Nature Biotechnol. 20, 58 (2002).
many of the problems evident in yeast to be that identification of genes based solely on 10. P. Kapranov et al., Science 296, 916 (2002).
greatly magnified in human. First, we ex- the human genome sequence, while possible 11. J. Rinn et al., Genes Dev. 17, 529 (2003).
12. P. Coelho et al., Curr. Opin. Microbiol. 3, 309 (2000).
pect the human genome to contain a vast in principle, will not be practical in the fore- 13. G. Giaever et al., Nature 418, 387 (2002).
number of potential ORFs given the small seeable future. Only through large-scale sys- 14. P. Coelho et al., Genes Dev. 16, 2755 (2002).
size of exons (average size ~140 base pairs) tematic functional genomics experiments and 15. K. T. Tycowski et al., Nature 379, 464 (1996).
16. B. Modrek, C. Lee, Nature Genet. 30, 13 (2002).
and the complexity of mRNA splicing (16, through careful sequence comparisons 17. E. Lander et al., Nature 409, 860 (2001).
19). It is doubtful that we will be able to find against related organisms will we be able to 18. P. Harrison, M. Gerstein, J. Mol. Biol. 318, 1155 (2002).
true genes among these ORFs solely by an- convincingly arrive at a definitive annotation 19. Z. Zhang et al., Genome Res. 12, 1466 (2002).
20. H. Liu et al., Genetics 144, 967 (1996).
alyzing their raw nucleotide sequences. In of the human genome. 21. P. Harrison et al., J. Mol. Biol. 316, 409 (2002).
fact, initial estimates of the number of genes 22. H. Mewes et al., Nature 387 (suppl.), 7 (1997).
in the human genome ranged from 20,000 to References and Notes 23. P. Harrison et al., Nucleic Acids Res. 30, 1083 (2002).
1. M. Morange, The Misunderstood Gene (Harvard Univ. 24. J. Venter et al., Science 291, 1304 (2001).
>100,000 (17, 23–25). 25. M. Das et al., Genomics 77, 71 (2001).
Press, Cambridge, MA, 2001).
One solution for annotating genes in se- 2. R. Falk, Stud. Hist. Philos. Sci. 17, 133 (1986).
26. M. Kowalczuk et al., Yeast 15, 1031 (1999).
27. P. Mackiewicz et al., Yeast 19, 619 (2002).
quenced genomes may be to return to the 3. S. Eddy, Cell 109, 137 (2002). 28. C. Zhang, J. Wang, Nucleic Acids Res. 28, 2804 (2000).
original definition of a gene—a sequence en- 4. C. Burge, S. Karlin, Curr. Opin. Struct. Biol. 8, 346 29. G. Blandin et al., FEBS Lett. 487, 31 (2000).
(1998).
coding a functional product—and use func- 5. M. Zhang, Nature Rev. Genet. 3, 698 (2002).
30. We thank A. Kumar, M. Vidal, S. Karlin, C. Burge, P.
Harrison, Z. Zhang, M. Zhang, W. Summers, M. Cherry,
tional genomics to identify them. Moreover, 6. C. Horak, M. Snyder, Funct. Integ. Genomics 2, 171 R. Lifton, M. Muensterkoetter, M. Seringhaus, and A.
if we add conservation information obtained (2002). Sali for helpful comments.

P L A N E TA R Y S C I E N C E
(MGS), which is (in addition to the classical
steady-state self-gravity of the planet) sub-
A Liquid Core for Mars? ject to gravitational forces resulting from the
mass redistributions induced by the tides.
Veronique Dehant Hence, information on the planet’s response
to the tidal force may be
ars is a planet very similar to Scientists interested deduced from the pre-

M Earth. Early in their evolution, in modeling the mar-


both planets must have been suffi- tian interior are there-
ciently hot to be molten. Earth still has a fore looking for other
cise reconstruction of
the MGS orbit. Because
this response depends
liquid core, but the smaller size of Mars kinds of complementa- on the internal structure
would favor faster cooling. Extrapolation ry data. As for Earth, of Mars, it is possible to
from Earth suggests the Sun’s gravitation- infer properties of the
Enhanced online at that Mars today al attraction induces core.
www.sciencemag.org/cgi/ should therefore not global phenomena on The mass reparti-
content/full/300/5617/260 have a liquid core. Mars—namely, tides tioning induced by the
However, small dif- and precession-nuta- tides is usually de-
ferences in elemental composition between tion (the motion of the scribed by a set of di-
the two planets prevent our simply extrap- rotation axis in space). mensionless numbers
olating from knowledge of Earth’s proper- Tides are deformations called “Love num-
ties (1). On page 299 of this issue, Yoder et induced by the gravita- bers,” which express
al. (2) present evidence that the iron core tional pull of the Sun. the nonrigidity of the
of Mars is liquid, with important implica- They are related to sur- planet. The value of the
tions for martian geology. face displacements, sur- k-Love number (the
There are a few constraints on Mars’ face gravity changes Love number relevant
deep interior based on analysis of martian (such as those that for the perturbation of
meteorites (3, 4), observation of the ab- would be measured by The physical state of the martian core the orbit) will be much
sence of a global magnetic field (5), and a gravimeter on the observed by MGS orbit tracking. larger if the core is liq-
knowledge of the planet’s mass and mo- martian surface), and uid than if it is solid
ments of inertia (6). Moments of inertia mass repartitioning inside the planet. These (liquid versus solid core values change by
quantify the global mass repartition within changes are periodic, with periods related ~50%) (9). Observational constraints on
Mars. They provide evidence for the exis- to Mars’s orbit around the Sun (and, to a mi- this k-Love number would allow the physi-
tence of a denser martian core and can be nor extent, to the orbits of the two martian cal state of the core to be determined.
used to constrain the core dimension (7). moons, Phobos and Deimos, around Mars). The long time series of Mars Global
However, the uncertainty of the core’s den- To study these phenomena, long-term Surveyor DSN (Deep Space Network) track-
sity and dimension remains large because observations—for example, of the annual or ing data provides such constraints. Smith et
they depend on the temperature profile and semiannual periods—are needed. Surface al. (10) have used these data to deduce the k-
light element abundance, and these proper- gravity data, surface displacements, and nu- Love number directly from the position of
ties are still unknown. tations cannot yet be observed because their the spacecraft orbiting Mars. However, the
measurement requires a network of geophys- main term of the gravitational potential was
The author is at the Observatoire Royal de Belgique,
ical stations on the martian surface (8). But unfortunately not very accurate.
Bruxelles, B-1180 Belgium. E-mail: veronique.dehant@ some information can be obtained from a Yoder et al. now use another indirect
oma.be Mars orbiter such as Mars Global Surveyor observation of the gravitational effect in-

260 11 APRIL 2003 VOL 300 SCIENCE www.sciencemag.org


Defining Genes in the Genomics Era
Michael Snyder and Mark Gerstein
Science 300, 258 (2003);
DOI: 10.1126/science.1084354

This copy is for your personal, non-commercial use only.

If you wish to distribute this article to others, you can order high-quality copies for your
colleagues, clients, or customers by clicking here.

Downloaded from www.sciencemag.org on January 14, 2015


Permission to republish or repurpose articles or portions of articles can be obtained by
following the guidelines here.

The following resources related to this article are available online at


www.sciencemag.org (this information is current as of January 14, 2015 ):

Updated information and services, including high-resolution figures, can be found in the online
version of this article at:
http://www.sciencemag.org/content/300/5617/258.full.html
A list of selected additional articles on the Science Web sites related to this article can be
found at:
http://www.sciencemag.org/content/300/5617/258.full.html#related
This article cites 28 articles, 7 of which can be accessed free:
http://www.sciencemag.org/content/300/5617/258.full.html#ref-list-1
This article has been cited by 56 article(s) on the ISI Web of Science
This article has been cited by 16 articles hosted by HighWire Press; see:
http://www.sciencemag.org/content/300/5617/258.full.html#related-urls
This article appears in the following subject collections:
Genetics
http://www.sciencemag.org/cgi/collection/genetics

Science (print ISSN 0036-8075; online ISSN 1095-9203) is published weekly, except the last week in December, by the
American Association for the Advancement of Science, 1200 New York Avenue NW, Washington, DC 20005. Copyright
2003 by the American Association for the Advancement of Science; all rights reserved. The title Science is a
registered trademark of AAAS.

You might also like