Complementary DNA Sequencing: Expressed Sequence Tags and Human
Genome Project
Mark D, Adams; Jenny M. Kelley; Jeannine D. Gocayne; Mark Dubnick; Mihael H.
Polymeropoulos; Hong Xiao; Carl R. Merril; Andrew Wu; Bjorn Olde; Ruben F.
Moreno; Anthony R. Kerlavage; W. Richard MeCombie; J. Craig Venter
Science, New Series, Vol. 252, No, 5013 (Jun. 21, 1991), 1651-1656.
Stable URL:
hhup/finks jstor.orgh
'sici=0036-8075%2819910621%293%3A252%3A5013403C 165 1%3ACDSEST%3E2.0,CO%3B2-6
Science is currently published by American Association for the Advancement of Science,
‘Your use of the ISTOR archive indicates your acceptance of JSTOR’s Terms and Conditions of Use, available at
hhup:/www.jstororg/about/terms.hml. JSTOR’s Terms and Conditions of Use provides, in part, that unless you
have obtained prior permission, you may not download an entire issue of a journal or multiple copies of articles, and
you may use content in the JSTOR archive only for your personal, non-commercial use.
Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained at
hup:/www jstor-org/journals/aaas.html,
Each copy of any
printed page of su
wt of a JSTOR transmission must contain the same copyright notice that appears on the sereen or
transmission,
JSTOR is an independent not-for-profit organization dedicated to ercating and preserving a digital archive of
scholarly journals. For more information regarding JSTOR, please contact support @ jstor.org.
hupulwww jstor.org/
Tue Nov 8 15:10:30 200887, ER, Fewon oa, Sco 247, 49 (1990),
AR. KW. Rin al shi 281, 1866 (1991),
89, §Srnavn, 2 Zo K.P, W. Bate, E. H. Chang, Nate 348,747
(0990)
90, 8. Kwok an J.J. Sina, PR Tel: Pres and Appts fr DNA
Anplftaion Ti Bech, 8. (Sttom, New York, 1989), pp 255-244,
91. Tf White, Mad). H.Pering, Ci: Chon rea, D. Wiliams
tnd 8. Kwok, in Lavatory Digi of Vin! Ife, He Lemme, El
(Debher, New Yk, 191) p. 187-173
PS. Wik, D- A Masger Be Higck, Bote 10, 56 (191),
Mike, J Mibalovet Rc iguch .S Wabi, Enh unpulced data
J Bllera J sp. he.168, 115 (1989); B. Boole el AJ. Ham. ee
48, 157 99),
PRS
98. RC Alln,G. Graves, B. Budo, Bivens 7,736 (1989); K. Kase,
J. Foes 8 35, 196 (1950).
96.8. Scart unpubl da
97. R elo ea Am J Ham. Govt 47,518 (199).
98. M, Stocking, D Hedgecock, R. Higych Vigan, HA. ech ii. 4370
(1)
99, We thnk oe collegues who conte the development and apn of
PGR. The space conuoin of thie rose and the many pubes on PCR
Prevents comprehensive rey of ances a pple we pg any
{farses wae fn x br ge, Weare pa
IRSukt 8 Schr. Higch, and Abramon for allowing te tr
Spubloied work Rexel review; and K- Leena pepsin of
oem
Complementary
DNA Sequencing: Expressed
Sequence Tags and Human Genome Project
Mark D. ApaMs, JENNY M. KELLEY, JEANNINE D. GocaYNE, Mark ‘DUBNICK,
Mirvac H. PoryMERoPouLos, HONG XIAO, CARL R. MERRIL, ANDREW WU,
ByoRN OLDE, RUBEN F. MorENO, ANTHONY R. KERLAVAGE,
W. RicHarp McComptg, J. CRAIG VENTER*
Automated partial DNA sequencing was conducted on
more than 600 randomly selected human brain comple-
mentary DNA (cDNA) clones to generate expressed se-
quence tags (ESTs). ESTs have applications in the discov-
«ery of new human genes, mapping of the human genome,
and “Hentfeation of coding regions in_ genomic se
ces. OF the sequences generated, 337 new
a a at eae cee
sm other organisms, such as a yeast RNA polymerase II
subunit; Drosophila kinesin, Notch, and Enhancer of spit;
and a murine tyrosine kinase . Forty-six ESTS
ee ea
polymerase chain reaction. This fast approach to cDNA
characterization will facilitate the tagging of most human
genes in a few years at a fraction of the cost of complete
genomic sequencing, provide new genetic markers, and
serve as a resource in diverse biological research fields.
t0 100,000 genes, up to 30,000 of which may be expressed,
in the brain (1). However, GenBank lists the sequence of
only a few thousand human genes and <200 human brain messen-
ger RNAs (mRNAs) (2). Once dedicated human chromosome
T HE HUMAN GENOME IS ESTIMATED TO CONSIST OF 50,000
MD. duns, JM. Kelley, J.D. Gray, M, Dab, A.W, B, Oe,
Moreno, AR: Kesnage W. Ie MeComb, and JC. Vener re the Sco
‘Resp’ Binchemy and Molccuar log, Nason Inne of Newologa
Daordes and Sue, Nason Lotus of Heh, Bethesda, MD 20892, NH
ogmergn HX aad CR Mera in he Ito Bashi
Gener Nol Tne of Mena Heath, Neurscenee Cero Se Elzabes
Hosp, Washington DO 50032
“Fo whom corerpondenc shoud be addres
a
2170NE 1991
sequencing begins in § yeats, itis expected that 12 to 15 years will
bbe required to complete the sequence of the genome (3). It is
therefore likely that the majority of human genes will remain
tunknown for at least the next decade. The merits of sequencing
DNA, reverse transcribed from mRNA, as a part of the human,
{genome project have been vigorously debated since the idea of
determining the complete nucleotide sequence of humans first
surfaced. Proponents of cDNA sequencing have argued that because
the coding sequences of genes represent the vast majority of the
information content of the genome, but only 3% of the DNA,
‘eDNA sequencing should take precedence over genomic sequencing
(4). Proponents of genomic sequencing have argued the difficulty of
finding every mRNA expressed in al issues, cll types, and devel-
‘opmental stages and have pointed out that much valuable informa-
tion from intronic and intergenic regions, including control and
regulatory sequences, will be missed by cDNA sequencing (3).
However, many genome enthusiasts have incorrectly stated that
‘gene coding regions, and therefore mRNA sequences, ae readily
predictable from genomic sequences and have concluded that there
is no need for large-scale CDNA sequencing. In fact, prediction of
‘transcribed regions of human genomic sequence is currently feasible
‘only for relatively large exons (6).
‘On the basis of our high output with automated DNA sequence
analysis of 96 templates per day and consideration of the above
issues, we initiated a pilot project to test the use of partial CDNA.
sequences (ESTS) in a comprehensive survey of expressed genes.
Sequence-tagged sites (STSs) are becoming standard markers for
‘the physical mapping of the human genome (7). These short
sequences from physically mapped clones represent uniquely iden
tified map positions, ESTs can serve the same purpose a the random
‘genomic DNA STS¢ and provide the additional feature of pointing
directly to an expressed gene. An EST is simply a segment of a
sequence from a cDNA clone that corresponds to an mRNA. ESTs
longer than 150 bp were found to be the most useful for similarity
searches and mapping.
ARTICLES 1661Libraries of Complementary DNAs
(Of the estimated 30,000 genes expressed in the human brain, as
‘many a5 20,000 may encode low-abundance, brain-pecific tran-
scripts (1), The fat that up to one-fourth of al genetic diseases affect
neurological functions is an indication of the diversity and impor-
tance of genes expressed in the brain (8)
‘An assumption in our choice of cDNA libraries was that random
primed and partial cDNA clones would be more informative in
‘identifying genes and constructing a useful EST database than
sequencing from the ends of full-length cDNAs (which contain 5°
and 3° untranslated sequences) would be. By obtaining coding
Sequences, we hoped t0 take advantage of more sensitive peptide
comparisons, in addition to nucleotide sequence comparisons. To
discover the inherent limitations to be overcome in a large-sale
cDNA sequencing project, we wanted to examine the diversity of
representative cDNA libraries, identify desirable and undesirable
characteristics of the libraries, and determine the information
content and accuracy of single-run sequencing from both coding,
and flanking regions. Single-run sequencing involves performing a
single sequence reaction, rather than relying on multiple, red:
dant reactions from each strand. We chose three commercial
human brain cDNA libraries made from mRNA isolated from the
hippocampus and temporal cortex ofa 2-year-old female and from
a feral brain (9).
‘Single-ran DNA sequence data were obtained from 609 random-
ly chosen cDNA clones (Table 1). Double-stranded cDNA clones in
the pBluescript vector (Stratagene) were sequenced by a cycle
sequencing protocol (10) with dye-labeled primers and 373A DNA
Sequencers (Applied Biosystems). The average length of usable
sequence was 397 bases with a standard deviation of 99 bases.
‘Subtracive hybridization has been used by researchers to reduce
the population of highly represented sequences in a ¢DNA library
(11, 12) by selectively removing sequences shared by another library
We tested. subtractive hybridization as a way of enhancing the
‘number of brain-specific clones in the hippocampus library by
hybridizing the hippocampus library with a WI38 human hung,
fibroblast cell line cDNA library and removing the common se
quences (Table 1) (12, 13).
Table 1. Competition of feo) das) a)
pa ma Ser 2 ito) 7 {eo $ fisy Bo)
No dunn mach 1 cas) 4 9) 20 (29) 8 03)
Bijele ee ss Gan) 2 (07 0 “Gy GH
Nome 1 a3} 3 83 ao % (3
vst scuevee, vou 282