You are on page 1of 17

ARTICLE

The Complete Genome Sequence of


Escherichia coli K-12
Frederick R. Blattner,* Guy Plunkett III,* Craig A. Bloch, Nicole T. Perna, Valerie Burland,
Monica Riley, Julio Collado-Vides, Jeremy D. Glasner, Christopher K. Rode, George F. Mayhew,
Jason Gregor, Nelson Wayne Davis, Heather A. Kirkpatrick, Michael A. Goeden, Debra J. Rose,
Bob Mau, Ying Shao

The 4,639,221– base pair sequence of Escherichia coli K-12 is presented. Of 4288 The first 1.92 Mb (13, 14), positions
protein-coding genes annotated, 38 percent have no attributed function. Comparison 2,686,777 to 4,639,221 [in base pairs (bp)],
with five other sequenced microbes reveals ubiquitous as well as narrowly distributed was sequenced from our overlapping set of
gene families; many families of similar genes within E. coli are also evident. The largest 15- to 20-kb MG1655 lambda clones (15)
family of paralogous proteins contains 80 ABC transporters. The genome as a whole is by means of radioactive chemistry and was
strikingly organized with respect to the local direction of replication; guanines, oligo- deposited in GenBank between 1992 and
nucleotides possibly related to replication and recombination, and most genes are so 1995. Subsequently, we switched to dye-
oriented. The genome also contains insertion sequence (IS) elements, phage remnants, terminator fluorescence sequencing (Ap-
and many other patches of unusual composition indicating genome plasticity through plied Biosystems). In addition to greater

Downloaded from http://science.sciencemag.org/ on February 9, 2020


horizontal transfer. speed and lower cost, this new technology
avoided electrophoretic compression arti-
facts, which, owing to its 50.8% G1C con-
tent, occur in practically every gene of E.
Because of its extraordinary position as a environment, allowing widespread dissemi- coli. For the next segment (positions
preferred model in biochemical genetics, nation to new hosts (6). Pathogenic E. coli 2,475,719 to 2,690,160), we obtained DNA
molecular biology, and biotechnology, E. strains are responsible for infections of the for sequencing by the popout plasmid ap-
coli K-12 was the earliest organism to be enteric, urinary, pulmonary, and nervous proach (16), in which nonoverlapping seg-
suggested as a candidate for whole genome systems. We chose strain MG1655 as the ments were excised directly from the chro-
sequencing (1, 2). The availability of the representative to sequence because it has mosome in circular form, gel-purified, and
complete sequence of E. coli should stimu- been maintained as a laboratory strain with shotgunned for sequencing. The largest por-
late further research toward a more com- minimal genetic manipulation, having only tion of the genome (positions 22,551 to
plete understanding of this important ex- been cured of the temperate bacteriophage 2,497,976) was sequenced from M13 Janus
perimental, medical, and industrial organ- lambda and F plasmid by ultraviolet light shotguns prepared from 11 I–Sce I frag-
ism. Since the inception of the E. coli and acridine orange, respectively (7). We ments of ;250 kb (17). Among the many
project, six other complete genomes have now know that these treatments resulted in advantages of the I–Sce I method are the
become publicly available (3). Genome se- a frameshift mutation at the end of rph, ability to select the size of fragment to be
quences, especially those of well-studied ex- causing low expression of the downstream shotgunned, elimination of redundant se-
perimental organisms, help to integrate a gene pyrE and, in turn, a pyrimidine star- quencing at the borders between segments,
vast resource of biological knowledge and vation phenotype (8). In addition, a muta- and the reliability inherent in sequencing
serve as a guide for further experimentation. tion in ilvG disrupts one of the isoleucine- DNA without intermediate cloning steps.
Availability of the complete set of genes valine biosynthesis pathways in all K-12 Because the DNA is never amplified, genes
also enables global approaches to biological isolates (9). Finally, almost all K-12 deriv- that might be deleterious when present in
function in living cells (4) and has led to atives, including MG1655, carry the rfb-50 multicopy form are not subject to rearrange-
new ways of looking at the evolutionary mutation, where an IS5 insertion results in ments or deletions. Each I–Sce I fragment
history of bacteria (5). the absence of O-antigen synthesis in the shotgun contained 15 to 30% random
Escherichia coli is an important compo- lipopolysaccharide (10). It will be interest- clones from elsewhere in the genome,
nent of the biosphere. It colonizes the lower ing to compare strain MG1655 with the which apparently arose from randomly
gut of animals, and, as a facultative anaer- K-12 strain W3110, which has been carried sheared genomic fragments comigrating in
obe, survives when released to the natural through more experimental treatments and the pulsed-field gel.
is being sequenced in Japan (11). The final stages entailed special atten-
F. R. Blattner, G. Plunkett III, N. T. Perna, J. D. Glasner, tion to problem areas. The region between
G. F. Mayhew, J. Gregor, N. W. Davis, H. A. Kirkpatrick, Sequencing Strategy positions 0 and 22,551 did not yield a suit-
M. A. Goeden, D. J. Rose, B. Mau, and Y. Shao are at the able I–Sce I fragment, so three lambda
Laboratory of Genetics, University of Wisconsin–Madison,
445 Henry Mall, Madison, WI 53706, USA. C. A. Bloch and Sequencing was carried out in sections, clones were selected to finally complete the
C. K. Rode are in the Department of Pediatrics, University with steadily improving technical ap- genome. One of them was found to contain
of Michigan School of Medicine, 1150 West Medical Cen- proaches. The M13 Janus shotgun strategy a deletion and had to be finished by shot-
ter Drive, Ann Arbor, MI 48105, USA. V. Burland is at FMC
Bioproducts, 191 Thomaston Street, Rockland, ME proved to be the most efficient strategy for gun sequencing of a long-range polymerase
04841, USA. M. Riley is at Marine Biological Laboratories, data collection and closure. It involved ini- chain reaction (PCR) fragment (18). Other
Woods Hole, MA 02543, USA. J. Collado-Vides is at the tial random sequencing at a four- to fivefold areas of the genome were also resequenced
Centro de Investigación sobre Fijación de Nitrógeno, Uni-
versidad Nacional Autónoma de México, Cuernavaca A.P.
redundancy in the Janus vector (12), fol- in this way. In total, long-range PCR (18)
565-A, Morelos 62100, México. lowed by computerized selection of tem- was used to close 36.9 kb of gaps, with
* To whom correspondence should be addressed. E-mail: plates to be resequenced from the opposite amplimers used directly as sequencing tem-
ecoli@genetics.wisc.edu end, followed by limited primer walking. plates or as source material for shotguns.

www.sciencemag.org z SCIENCE z VOL. 277 z 5 SEPTEMBER 1997 1453


The completed sequence was deposited ity of the hits were for one type of function, found within this region. We also searched
in GenBank on 16 January 1997; in that such as a permease or a class of enzymes. for other potential regulatory sites within the
sequence 168 ambiguity codes reflected un- When the functions of the hit sequences 400-bp segments upstream of genes. This
certainties in the original determination. were varied and there was no solid agree- search was based on an exhaustive collection
While this manuscript was in review, ment even for type of function, or when of known functional sites for 56 transcrip-
additional PCR sequencing was undertaken only one sequence was hit, no function was tional regulatory proteins. More detail about
to resolve all of these ambiguous residues, assigned to the query ORF and it was count- these methods is available elsewhere (26).
and the affected annotations were updated ed among the unknowns. The codon adaptation index (CAI) was
accordingly. The average distance between E. coli calculated for each ORF according to the
genes is 118 bp. The 70 intergenic regions method of Sharp and Li (27). The CAI
Annotation larger than 600 bp were reevaluated for the measures the extent to which codon usage
presence of ORFs (Geneplot, DNASTAR agrees with an E. coli reference set from
Annotation is an ongoing task whose goal is Inc.) and searched against the entire highly expressed genes. CAI is a predictor of
to make the genome sequence more useful GenBank database for DNA sequence the extent of gene expression. This is attrib-
by correlating it with other knowledge. Spe- (BLASTN) and protein coding (BLASTX) uted to correspondence with iso-accepting
cifically, we attempted to (i) identify genes, features (23). Closer inspection revealed tRNA abundance of E. coli and optimal
operons, regulatory sites, mobile genetic el- that 15 of these regions contain previously (intermediate) codon-anticodon interaction
ements, and repetitive sequences in the ge- unannotated ORFs, which in most cases energy (28). Genes with exceptionally low
nome; (ii) assign or suggest functions where were overlooked because of their small size. CAI values may be recent horizontal trans-
possible; and (iii) relate the E. coli sequence An additional 11 intergenic regions contain fers that still reflect the optimal codon usage
to other organisms, especially those for sequence features such as long untranslated or mutational spectrum of their previous

Downloaded from http://science.sciencemag.org/ on February 9, 2020


which complete genome sequences are leader sequences [for example, oppA messen- host (29). We identified clusters of four or
available. Currently, the annotation in- ger RNA (mRNA) extends ;500 bp up- more adjacent genes with low CAI values
cludes 4288 actual and proposed protein- stream of the start codon (24)] or well-char- (,0.25) and also identified all genes in the
coding genes, and one-third of these genes acterized control regions [for example, the lower 10th percentile of CAI observed in
are well characterized. Postulation of genes araFGH operon control region (25)]. The this genome.
in uncharacterized base sequences was sur- remaining 44 large intergenic regions fall The annotated sequence (accession
prisingly difficult. They were selected from into three general classes: putative gene reg- number U00096) is available at the Nation-
among the numerous available open read- ulatory regions, large repetitive sequences, al Center for Biotechnology Information
ing frames (ORFs) on the basis of codon and unknowns. (NCBI) through the Entrez Genomes divi-
usage statistics, sequence searches versus Genes separated by more than 600 bp are sion, GenBank, and the BLAST databases.
SWISS-PROT release 34, Link’s database likely to contain independent regulatory se- Our FTP site (ftp.genetics.wisc.edu) will
of NH2-terminal peptide sequences from E. quences. Twenty-nine large intergenic re- maintain an updated version of the se-
coli, computer prediction of signal peptides, gions contain sequences suggestive of regu- quence as additional annotations or correc-
upstream matches to the Shine-Delgarno latory functions, including 21 with predicted tions are made; the version discussed here is
ribosome binding site, and other informa- regulatory protein binding sites. There are M49.
tion including personal communications 13 regions between divergently transcribed
from colleagues (19). Assignment of NH2- ORFs, and 11 of these have at least one Overview of the Sequence
termini posed special problems because predicted promoter for each ORF (2 have
most ORFs contain multiple in-frame start only one predicted promoter). The 16 re- The genome of E. coli, diagrammed in Fig.
codons. In the absence of other informa- gions between ORFs transcribed in the same 1, consists of 4,639,221 bp of circular du-
tion, we generally selected the ORF with direction contain at least one predicted pro- plex DNA (30). Both base pair and minute
the longest possible NH2-terminus. This moter for the downstream ORF, and several scales are shown; base pair 1 was assigned in
method preserves the most coding informa- contain a terminator for the upstream ORF. an apparently featureless region between
tion for analysis, but it may not reflect the Seven of the large intergenic regions, in- genes lasT and thrL. Protein-coding genes
situation in vivo. cluding the largest region overall (1730 bp), account for 87.8% of the genome, 0.8%
Functions of previously known E. coli consist of repeated sequences such as REP or encodes stable RNAs, and 0.7% consists of
proteins were collected from the Gen- LDR, as described below. Seven intergenic noncoding repeats, leaving ;11% for regu-
ProtEC (20) and EcoCyc (21) databases. regions larger than 600 bp have no predicted latory and other functions. A radial plot
The function of new translated sequences regulatory or coding functions. Five of these shows E. coli’s local similarity to sequenced
was imputed from sequence similarity (22). regions contain sequences that could encode bacteriophage genes. The polar coordinate
Each gene (including stable RNA genes) in proteins of at least 50 amino acids, although plot of CAI is designed to highlight regions
the sequence was assigned a unique numeric codon usage patterns for these ORFs suggest of the genome with unusual codon usage,
identifier beginning with a lowercase “b”; that they are not expressed. It is likely that which may signify recent immigration by
when no name has been assigned to a given these regions contain additional, as yet un- horizontal transfer. Some gene clusters with
gene, it is referred to by this number. A discovered, functions such as binding sites low CAI values correspond to known cryp-
specific physiological role was assigned if for additional regulatory proteins. tic prophages, and others point to possible
most of the hits were for a specific function We searched for promoter and protein locations of additional horizontally ac-
such as alcohol dehydrogenase, but if the binding site sequences upstream of 2436 quired elements.
substrates varied among the hits, the com- genes. This includes all genes except those The origin and terminus of replication
mon denominator (for example, permease that are less than 70 bp from the 39 end of an divide the genome into oppositely replicat-
or kinase) was assigned to the ORF, sub- adjacent gene transcribed in the same direc- ed halves, which we term replichores. Rep-
strate specificity unknown. If less specificity tion. We limited our search to the 250 bp lichore 1, which is replicated clockwise, has
was found among the hits, a general func- upstream of the predicted translational start the presented strand of E. coli as its leading
tion was assigned to an ORF when a major- sites because E. coli promoters are typically strand; in replichore 2 the complementary

1454 SCIENCE z VOL. 277 z 5 SEPTEMBER 1997 z www.sciencemag.org


ARTICLE
strand is the leading one. Many features of variety of systems, have commented on base tion. We extended this G-C skew analysis
E. coli are oriented with respect to replica- compositional asymmetries correlated with to the entire E. coli genome (Fig. 2), ob-
tion. All seven ribosomal RNA (rRNA) the direction of replication. For E. coli, the serving the same sharp transition at the
operons, and 53 of 86 tRNA genes, are leading strands of both replichores have terminus that he reported at the origin.
expressed in the direction of replication significantly (P , 0.001) greater abundance These clear trends in base compositional
(Fig. 1). Approximately 55% of protein- of G (26.22%) than its complementary skew apply to genes in both orientations, to
coding genes are also aligned with the di- partner C (24.58%) or the alternative pair intergenic regions, and to all codon posi-
rection of replication, confirming an early A (24.52%) or T (24.69%). Lobry (33) tions (Table 1), supporting the idea ad-
observation of Brewer (31). plotted G-C skew for a 1.6-Mb section of E. vanced by Lobry, Perna, and Wu (32, 33)
Compositional organization of the genome. coli surrounding the origin and summarized that leading and lagging strands are subject
Several authors (32, 33), in analyzing a the data by codon position and gene direc- to differential mutation as the result of

0
Re
pli
ch
or
e
90

1
10

Downloaded from http://science.sciencemag.org/ on February 9, 2020


00
,0

Or
00

igi
4,0

n
80

1,00
20

0,000
E. coli
K-12 MG1655
4,639,221 bp
70

30
3,0

Te
rm
00

inu
,0

s
00

60
ep 40
R

li c
ho
r e
2 00
0,0
50 2 , 00

Fig. 1. The overall structure of the E. coli genome. The origin and terminus of and tRNA genes are shown as green arrows. The next circle illustrates the
replication are shown as green lines, with blue arrows indicating replichores 1 positions of REP sequences around the genome as radial tick marks. The
and 2. A scale indicates the coordinates both in base pairs and in minutes central orange sunburst is a histogram of inverse CAI (1 – CAI), in which long
(actually centisomes, or 100 equal intervals of the DNA). The distribution of yellow rays represent clusters of low (,0.25) CAI. The CAI plot is enclosed by
genes is depicted on two outer rings: The orange boxes are genes located on a ring indicating similarities between previously described bacteriophage pro-
the presented strand, and the yellow boxes are genes on the opposite strand. teins and the proteins encoded by the complete E. coli genome; the similarity
Red arrows show the location and direction of transcription of rRNA genes, is plotted as described in Fig. 3 for the complete genome comparisons.

www.sciencemag.org z SCIENCE z VOL. 277 z 5 SEPTEMBER 1997 1455


asymmetry inherent to the DNA replica- and second positions, and the net G-rich carried out further analyses of E. coli by
tion mechanism. This, combined with nat- tendency of the leading strand relative to constructing a reference sequence com-
ural selection, leads to an observed base the lagging one is seen in both first and posed of the leading strands of each repli-
distribution that depends in part on the second codon positions, despite strong and chore concatenated at a novel joint, and we
mutational pattern and in part on selection. sometimes opposite codon usage preferenc- examined this sequence for oligonucleotide
Hence, intergenic regions and third posi- es at those positions. distribution. The most frequent oligomers
tions in E. coli are more skewed than first Replication, recombination, and skew. We in this leading strand (for example, octam-

1st
position

2nd
position

3rd
position

All
positions

leftward

Downloaded from http://science.sciencemag.org/ on February 9, 2020


genes

intergenic

rightward
genes

Chi

8-mer

Rhs

REP

IRU
Box C
RSA
Ter
LDR
iap
IS1
IS2
IS3
IS4
IS5
IS150
IS186
IS30
IS600
IS911

Phage

EcoK

Terminus Origin
0 1,000,000 2,000,000 3,000,000 4,000,000
Position (base pairs)
Fig. 2. Base composition is not randomly distributed in the genome. G-C next 18 horizontal lines correspond to distinct classes of repetitive
skew [(G – C)/(G 1 C)] is plotted as a 10-kb window average for one strand elements. The penultimate line contains a histogram showing the simi-
of the entire E. coli genome. Skew plots for the three codon positions are larity (the product of the percent of each protein in the pairwise alignment
presented separately; leftward genes, rightward genes, and non–protein- and the percent amino acid identity across the aligned region) of known
coding regions are shown in lines 5, 6, and 7. The two horizontal lines phage proteins to the proteins encoded by the complete E. coli genome.
below the skew plots show the distribution of two highly skewed octamer The last line indicates the position and orientation of the EcoK restriction-
sequences, GCTGGTGG (Chi) and GCAGGGCG (8-mer). Tick marks indi- modification site AACNNNNNNGTGC (N, any nucleotide). Two vertical
cate the position of each copy of a sequence in the complete genome and lines through the plots show the location of the origin and terminus of
are vertically offset to indicate the strand containing the sequence. The replication.

1456 SCIENCE z VOL. 277 z 5 SEPTEMBER 1997 z www.sciencemag.org


ARTICLE
ers; Table 2) form a family containing the nov has recently proposed a role for Chi On the basis of similarity searches, we are
trimer CTG, often within the pentamer sites in the recombinational repair of col- confident that there is an operon that starts
GCTGG, as also noticed by Karlin and lapsed replication forks, which may explain with the monooxygenase gene mhpA, fol-
co-workers (34). We note that the DnaG their extreme skew (38), but a secondary lowed by the known dioxygenase gene mhpB,
primase-binding site includes (or is) the role as a primase-binding site may be suffi- the hydrolase gene mhpC, the hydratase gene
sequence CTG, with T being the template cient to explain this bias. mhpD, the dehydrogenase gene mhpF, and
for the first base of the RNA primer of Rare tetramer CTAG. It is well known the known gene mhpE coding for 4-hydroxy-
Okazaki fragments (35). Although there is that the palindromic tetramer CTAG is 2-oxovalerate aldolase. All the genes (posi-
no direct proof implicating these sequences extremely rare in E. coli, with an abundance tions 367,835 to 373,095; b0347 to b0352)
in discontinuous replication, their spacing 5% of that predicted from the base compo- are in the same order as the enzymes of the
is consistent with Okazaki fragment sizes sition. Various explanations have been of- pathway. We propose that the next gene
and their distribution is skewed toward the fered (39, 40). In Table 3 we have analyzed upstream (positions 366,811 to 367,758;
leading strand, as expected. Although the its distribution in various subsets of the
skews are significant, the most frequent oc- genome. Clearly, the rarity of CTAG is
tamers on the leading strand are overrepre- most pronounced in protein-coding regions. Table 2. Frequent octamers and their skew. The
24 most frequent octamers are ranked by fre-
sented on the lagging strand as well. Al- Its occurrence is considerably higher in in- quency of occurrence on the leading strand (oc-
though leading strand replication is highly tergenic DNA, but it is surprisingly abun- tamers with the same frequency of occurrence are
asymmetric in vitro, both leading and lag- dant in genes coding for structural RNAs, ordered alphabetically). Frequent octamers that
ging strands are reported to replicate dis- especially in that minuscule portion of the are reverse complements of frequent octamers
continuously in vivo (36). The high abun- genome that codes for tRNAs. Danchin and are identified by their rank (in parentheses) beside
dance of these proposed DnaG primase- co-workers (40) have hypothesized that that of their complement. All primary sequences

Downloaded from http://science.sciencemag.org/ on February 9, 2020


are aligned by the CTG trimer. The average spac-
binding sites on both strands supports a CTAG may “kink” DNA and thereby in- ing for nonoverlapping sequences from this list is
model in which both strands are replicated terfere with function. It is also possible that on the order of 1.3 kb. The percent skew is 100 3
discontinuously. The associated skews im- some peculiar folding behavior of CUAG in (f – f9)/(f 1 f9), where f is the frequency of an
ply that the leading strand has fewer sites RNA might interfere with mRNA function octamer and f9 that of its reverse complement.
for Okazaki initiation. while having no negative effect on stable
The recombinational hotspot Chi (37, RNA species. Rank Octamer
Skew
Count
38), the third most abundant octamer of the (%)
leading strand, also contains the proposed Newly Proposed Genes and 1 (9) cgCTGgcg 15.6 867
DnaG primase-binding site. In fact, none of Previously Mapped Genes 2 (16) ggcgCTGg 19.6 826
the frequent octamers differs from Chi by 3 (5 Chi) gCTGgtgg 50.8 761
changes known to inactivate the recombi- Six new tRNA genes. In this study we 4 (17) gCTGgcgg 13.1 719
national activity of Chi (37, 38). Hence, it discovered six new tRNA genes. Four of 5 (11) tgCTGgcg 9.4 719
is possible that other members of the family the genes—valZ, lysY, lysZ, and lysQ (posi- 6 gcgCTGgc 17.2 691
7 tggcgCTG 15.4 677
may display Chi activity. As noted earlier tions 780,291 to 780,875)—are part of the 8 (24) gCTGgcgc 12.6 659
(14), the Chi site is markedly skewed lysT operon and consist of a duplicate of 10 cgCTGgtg 27.0 617
toward the leading strand. One must skip valT and three duplicates of lysW. The oth- 12 CTGgcggc 16.2 589
to the 251st most frequent octamer, er two genes form single-gene transcrip- 13 CTGgcgca 13.2 575
GCAGGGCG, to locate a higher skew tional units: asnW (positions 2,056,049 to 14 gCTGgcga 9.4 570
(57%) than that of the Chi site (50%) (Fig. 2,056,124) is a duplicate copy of asnT, and 15 TGgcggcg 19.3 561
18 aaCTGgcg 12.5 543
2, lines 8 and 9). Chi sites are implicated in ileY (positions 2,783,782 to 2,783,857) is a 19 gCTGgaag 11.0 538
RecBCD-mediated recombination (37), near copy of ileX, differing in a single com- 20 CTGgcgcg 15.4 524
and as part of this process it is supposed that pensating base pair change in the amino- 21 gcgCTGga 16.4 519
single-stranded DNA intermediates having acyl stem of the tRNA (C6zG67 in ileX, 22 CTGgcgaa 14.9 515
Chi at the 39 end are formed, which then A6zT67 in ileY). 23 tgCTGgtg 29.1 515
invade the recipient chromosome to form a An operon for degradation of aromatic com-
“D loop.” This implies the existence of a pounds. Six E. coli enzymes are known to
Chi site on the displaced strand. If the CTG constitute a pathway for the degradation of Table 3. Distribution of CTAG sequences.
of Chi is also a primase binding site, Oka- aromatic compounds such as phenylpropi-
zaki initiation at Chi could facilitate strand onate, but only two of the genes have been Category of DNA
CTAG Average
assimilation by branch migration. Kuzmi- previously identified, mhpB and mhpE (41). count spacing

All E. coli 886 7161


Protein-coding 569 7159
Table 1. G-C skew for each of the three codon positions, calculated separately for the coding strand of sequence
2357 forward genes (whose coding strand is the leading strand) and 1929 backward genes (whose TAG terminators 67
coding strand is the lagging strand). The net skew attributable to replication direction is the difference REP sequences 4 6144
between the values for the forward and the backward genes divided by 2. All non–protein-coding 317 1782
sequences
Net G-C skew Regulatory regions 251 1999
Average rRNA genes 46 697
Position Forward genes Backward genes attributable to
G-C skew tRNA genes 13 514
replication direction
10Sa RNA (ssrA) 2 233
1 19.41 16.08 17.74 1.66 RNase P M1 RNA 1 377
2 –9.34 –11.79 –10.57 1.22 (rnpB)
3 7.99 – 0.48 3.75 4.23 Expected from base 18,101 256
Average 6.02 1.27 3.64 2.37 composition

www.sciencemag.org z SCIENCE z VOL. 277 z 5 SEPTEMBER 1997 1457


b0346) may be the regulator for this path- at a slow rate (41). Further research will be tively, as well as near-equal length) to mviM
way, because this sequence is similar to a needed to determine whether this is a phys- and mviN, two Salmonella virulence factors
number of transcriptional regulators. iologically significant pathway, and if so, (43). Homologs of both mviM and mviN also
A second operon for degradation of aromat- under what conditions. have been identified in Haemophilus (3).
ic compounds. We have found a previously Flagellar operons nearly identical to those of Open reading frames and gene function
unrecognized set of E. coli genes (positions Salmonella. Escherichia coli has an array of 14 class assignments. Figure 3 is a detailed
2,667,052 to 2,671,269) that resemble flagellar synthesis genes (b1070 to b1083), graphical presentation of the genome show-
Pseudomonas genes for the degradation of only two of which have been previously ing the arrangement of putative and known
the aromatic compounds toluene, benzene, reported: f lgM and f lgL. One additional gene genes, operons, promoters, and protein
and biphenyl (42). The first three genes is involved with initiation of filament assem- binding sites. Of the 4288 ORFs annotated
(b2538 to b2540) encode the a and b sub- bly: f lgN, which precedes f lgM, a negative in the sequence, 1853 are previously de-
units and the ferredoxin component of the regulator of flagellin synthesis. In the region scribed genes. (A complete listing of E. coli
1,2-dioxygenase that opens the rings and between f lgM and f lgL, we identified ho- ORFs is available at www.genetics.wisc.edu/
oxidizes carbons 1 and 2. The gene encod- mologs of the Salmonella typhimurium f lgA and is likely to change as functional data
ing the last component of the dioxygenase, (basal-body P-ring formation), f lgB (putative accumulate.) The distribution of start
the ferredoxin reductase (b2542), is sepa- flagellar basal-body formation protein), f lgC codons is as follows: ATG, 3542; GTG,
rated from the first three genes by another (putative flagellar basal-body formation pro- 612; and T TG, 130. There is also one AT T
ORF (b2541). The product of this ORF tein), f lgD (basal-body rod modification pro- and possibly a CTG (44). The distribution
resembles the enzyme dihydro-1,2-diol de- tein), f lgE (f lagellar hook protein), f lgF of translation termination codons is as fol-
hydrogenase, which acts on the product of (putative f lagellar basal-body formation pro- lows: TAA, 2705; TGA, 1257; and TAG,
the dioxygenase to generate catechol. This tein), f lgG (f lagellar basal-body formation 326. We assigned 405 genes with the start

Downloaded from http://science.sciencemag.org/ on February 9, 2020


proposed operon is preceded by a divergent- protein), f lgH (f lagellar L-ring protein pre- codon overlapping the preceding stop, dis-
ly transcribed ORF (b2537) resembling a cursor), f lgI (f lagellar P-ring protein precur- tributed as follows: ATGA, 224; TAATG,
number of transcriptional regulators, which sor), f lg J (f lagellar protein), and f lgK (f lagel- 98; TGATG, 48; GTGA, 28; TAGTG, 4;
may be involved in the regulation of the lar hook-associated protein 1) genes. The and TTGA, 3. The most common overlap
genes. We do not know the substrate for gene arrangement of this cluster (positions in phage lambda is also ATGA (45).
this operon, or whether it has enzymes with 1,128,637 to 1,140,209) is identical to that The 4288 ORFs were searched for
sufficiently broad specificity to use several of the cluster at 26.5 centisomes on the matches to the Link database of peptides
related substrates. It is also not clear how Salmonella chromosome. In fact, the entire excised from two-dimensional gels (19).
catechol might be further metabolized. In f lagellar systems of E. coli and S. typhimurium These searches confirmed the expression of
Pseudomonas catechol is normally metabo- are essentially identical in most respects, 30 hypothetical ORFs. In addition to the
lized by either an ortho or meta pathway, with the current organization of genes pre- 194 Link sequences annotated in SWISS-
and E. coli has some very distant sequence dating the divergence of these two species PROT release 34, our searches identified
similarities to some of the meta pathway (43). Two additional genes (b1068 and nine NH2-terminal sequences correspond-
enzymes, especially to the penultimate step. b1069), preceding the f lg genes, show strong ing to dsbA, b2548, gcvT, glpQ, trpB, ydfG,
In addition, MhpB can metabolize catechol similarity (81% and 94% identity, respec- ygaG, ygiN, and yif E.
The longest ORF encodes a 2383–amino
acid protein of unknown function, resem-
Table 4. Distribution of E. coli proteins among 22 functional groups (simplified schema). bling several bacterial attaching and effac-
ing proteins and invasins—virulence factors
Percent of in pathogenic strains of E. coli and other
Functional class Number
total
enteric bacteria (46). The average ORF size
Regulatory function 45 1.05 is 317 amino acids; there are four ORFs in
Putative regulatory proteins 133 3.10 the range 1500 to 1700 amino acids, 51 in
Cell structure 182 4.24 the range 1000 to 1500 amino acids, and
Putative membrane proteins 13 0.30 381 that are smaller than 100 amino acids.
Putative structural proteins 42 0.98
Phage, transposons, plasmids 87 2.03
In general, it was difficult to assign small
Transport and binding proteins 281 6.55 ORFs unless they exhibited typical E. coli
Putative transport proteins 146 3.40 codon usage or had been characterized bio-
Energy metabolism 243 5.67 chemically (for example, leader peptides).
DNA replication, recombination, modification, and repair 115 2.68 Two complementary catalogs were de-
Transcription, RNA synthesis, metabolism, and modification 55 1.28 vised originally to classify functions of E.
Translation, posttranslational protein modification 182 4.24
Cell processes (including adaptation, protection) 188 4.38
coli gene products, one for broad functions
Biosynthesis of cofactors, prosthetic groups, and carriers 103 2.40 of the gene product (for example, enzyme,
Putative chaperones 9 0.21 regulator, or transport protein) and another
Nucleotide biosynthesis and metabolism 58 1.35 for specific physiological roles in the cell
Amino acid biosynthesis and metabolism 131 3.06 (47). A simplified composite system was
Fatty acid and phospholipid metabolism 48 1.12 devised to represent E. coli gene products
Carbon compound catabolism 130 3.03
Central intermediary metabolism 188 4.38
ranging from precisely known to loosely
Putative enzymes 251 5.85 attributed functions in Fig. 3. Table 4 sum-
Other known genes (gene product or phenotype known) 26 0.61 marizes the functional class assignments
Hypothetical, unclassified, unknown 1632 38.06 used to classify each ORF. Pending the
location of the coding sequences for 383
Total 4288 100.00* known E. coli proteins that are not yet
* Total of these rounded values is 99.97%. associated with ORFs, nearly 40% of the

1458 SCIENCE z VOL. 277 z 5 SEPTEMBER 1997 z www.sciencemag.org


ARTICLE
ORFs are completely uncharacterized. This only one match to an E. coli protein. The transporter proteins. Riley and Labedan (5)
is similar to the proportion of unassigned largest number of matches to E. coli is found compiled a list of 54 ABC transporters
ORFs in other recently sequenced bacterial in the Haemophilus influenzae genome (1.83 among E. coli proteins, and analysis of the
genomes: Haemophilus influenzae (43%), Mb encoding 1703 proteins with 1130 hits proteins from the complete genome reveals
Synechocystis sp. (45%), and Mycoplasma to E. coli proteins). Haemophilus, like E. coli, an additional 26 members of this family.
genitalium (32%) (3). is a member of the gamma subdivision pro- Determination of the number of indepen-
The largest well-defined functional teobacteria, making it the most closely re- dent paralogous groups requires a careful
group consists of 281 transport and binding lated complete genome available for consid- examination of all the matches to a partic-
proteins, and there are an additional 146 eration (49). We also compared two addi- ular protein, followed by inspection of all
putative transport and binding proteins. In tional eubacterial genomes: Synechocystis sp. hits to proteins contained within the initial
contrast, 123 transport proteins have been (3.6 Mb, 3168 proteins, 675 hits) and My- list of matches (5), and will require further
identified in Haemophilus and 34 in Myco- coplasma genitalium (0.58 Mb, 468 proteins, analysis.
plasma (3). Whether this difference reflects 158 hits). All four eubacteria have 111 Many proteins are members of paralogous
a larger number of substrates to transport, proteins in common. gene families and have significant matches
greater specificity of particular transporters, The numbers of matches across kingdoms in other species. It will be difficult, if not
or greater redundancy in E. coli is not yet in the archeon Methanococcus jannaschii (1.6 impossible, to unambiguously determine the
clear. In sharp contrast, the number of pro- Mb, 1738 proteins, 231 hits) and the eu- relation between similar genes in different
teins involved in translation is similar for E. karyote Saccharomyces cerevisiae (12.1 Mb, species when the level of divergence be-
coli (182), Haemophilus (141), and Myco- 5885 proteins, 254 hits) are remarkably sim- tween orthologous genes approaches the lev-
plasma (101). ilar to each other. However, according to our el of divergence among paralogs within a
On the basis of 1827 characterized E. coli significance criteria, only 16 proteins are species. The genes in all genomes are derived

Downloaded from http://science.sciencemag.org/ on February 9, 2020


proteins, Riley and Labedan (48) described conserved among all six taxa; they are largely from a set of unique ancestral genes present
75 pairs of isozymes, or multiple enzymes translation proteins, including seven ribo- in a progenitor of all extant organisms. Upon
with identical or nearly identical function. somal proteins and two aminoacyl syntheta- duplication of an ancestral gene, copies of
An additional 11 groups of potentially re- ses. One is classified as a hypothetical ORF the gene may be subsequently lost through
dundant enzymes have been identified in E. coli, Saccharomyces, and Methanococcus, natural selection or simply by a neutral sto-
among the newly sequenced ORFs. Al- but is described as a putative O-sialoglyco- chastic process. Alternately, the copies may
though sequence similarity and functional protein endopeptidase in both Haemophilus be retained as redundant systems for execut-
overlap are not synonymous, these highly and Mycoplasma on the basis of similarity to ing the original biological function, or they
conserved proteins [point accepted muta- a Pasteurella haemolytica protein (50). may diverge, with one or both copies giving
tions per 100 residues (PAM) , 110] are Nearly 60% of E. coli proteins have no rise to a novel function. This process of
likely to carry out the same physiological match in any other complete genome con- duplication and divergence, along with the
function. sidered. These may represent the subset of occasional transfer of genes between strains
We have not yet attempted to represent proteins specific to enterobacterial or E. coli and species, gives rise to the present contents
proteins with multiple roles that depend on processes as well as insertion elements and of a genome (51). Characterization of all E.
physiological circumstances. On the basis of phage with restricted host range. The 629 coli paralogous groups and comparison with
our present knowledge, one-fourth of the proteins shared exclusively by Haemophilus groups from other species will allow exami-
cell’s resources are devoted to small-mole- and E. coli include new genes acquired in nation of the evolutionary events surround-
cule metabolism and about one-eighth to this lineage. The 292 proteins common to ing protein diversification.
large-molecule metabolism, and at least E. coli and just one of the other four species Operons, promoters, and protein binding
one-fifth of the cell’s resources are associat- are indicative of numerous gene losses over sites. Operons, promoters, and regulatory
ed with cell structure and processes. Of the course of genome evolution. This pre- protein binding sites are shown in Fig. 3. A
course, this distribution may be altered liminary analysis of similarity among se- total of 2584 predicted and known operons
when the specific functions of the remain- quences of complete genomes provides are represented. Of 2192 predicted operons,
ing 40% of the gene products become many avenues for further study. a surprisingly high 73% have only one gene,
known. Similarity among E. coli proteins. Also 16.6% have two genes, 4.6% have three
Homology between E. coli proteins and the presented in Fig. 3 is a comparison of all the genes, and 6% have four or more genes. All
other sequenced genomes. Figure 3 also pre- proteins of E. coli with each other. These of them have at least one promoter, either
sents comparisons of the 4288 E. coli pro- can be divided into families defined by se- known or predicted. Of 2405 operon regions
teins with data from five other complete quence relatedness (5). A paralogous family with predicted promoters, 68% contain one
genomes (3), representing the three major is generally composed of proteins within a promoter, 20% contain two promoters, and
kingdoms. There are two components to single species with similar, though not nec- 12% contain three or more promoters. Reg-
the significance of each database hit: the essarily identical, functions. We define pu- ulatory sites are described in 603 regions
degree of similarity between the aligned tative paralogs as ORFs that share at least corresponding to 16% of operon regions and
proteins, and the amounts of the two pro- 30% sequence identity over more than 60% 10% of interoperonic regions. We estimate
teins that are alignable. In Fig. 3, we have of their lengths. The similarity index for the that our search included representatives of
plotted a simple index that takes both com- best putative paralog of each gene is plotted 15 to 25% of the total number of different
ponents into account. in Fig. 3. Many E. coli proteins—1345— regulatory binding proteins in E. coli, includ-
To provide a preliminary estimate of the have at least one paralogous sequence in ing sites that are recognized by global regu-
number of orthologous sequences shared by the genome. The relative size of a gene lators of transcription (for example, sites
E. coli and each of these other complete family for each protein is also shown in Fig. bound by the cyclic AMP receptor protein,
genomes, we counted only matches includ- 3. The largest number of significant hits to CRP). Within the regions with predicted
ing at least 60% of both proteins in an a single protein (b1917) was 37. This pro- sites, 89.2% are regulated by one protein,
alignment with at least 30% identity. Each tein is a member of the largest family of 8.4% by two proteins, and 2.4% by three or
protein from another species was permitted paralogous proteins in E. coli, the ABC more proteins. In 81.2% of these regions

www.sciencemag.org z SCIENCE z VOL. 277 z 5 SEPTEMBER 1997 1459


only one site was found, 12.2% have two 29-bp sequence called the iap repeat is boundaries of the other lambdoid pro-
sites, and 6.6% have three or more sites. found in three clusters of 14, 2, and 7 phage remain to be annotated. The “cryp-
These numbers are more or less consistent copies, for a total of 23 copies (53, 56). No tic P4” phage CP4-57 (59) is located at 57
with the distribution of regulatory sites additional copies of either of these sequenc- minutes, where it is inserted into the sta-
among a set of promoters where transcrip- es are found in the rest of the genome. ble RNA gene ssrA. The junction se-
tional regulation has been well studied. In Insertion sequences. The chromosome of quences (59) allowed us to identify the
this collection of 132 promoters, 73% are E. coli K-12 contains a number of autono- extended attL and attR sequences and to
regulated by one protein, and 43% contain mously transposable elements that are im- define the endpoints of the prophage
only one site for the binding of a regulator plicated in the generation of many sponta- (positions 2,753,956 and 2,776,007); our
(52). A number of E. coli genes are part of neous mutations—not only by insertional earlier report (GenBank accession num-
known operons (Fig. 3, red arrows). inactivation, but also by deletions, duplica- ber U36840) that attR was deleted in
Repeated sequences. A number of repeat- tions, and inversions. Estimates have been MG1655 was a misinterpretation.
ed sequences have been characterized in the made as to the IS element set present in E. We have discovered two new cryptic
E. coli genome (53). The number and dis- coli K-12 when originally isolated (57). The prophages, seemingly related to CP4-57,
tribution of these sequences in the whole IS elements’ map positions are shown in Fig which we name CP4-6 and CP4-44 after
genome are summarized in Fig. 2. The larg- 2. There are two multicomponent clusters. their minute positions. The three CP4
est repeated sequences in E. coli K-12 are At positions 269,430 to 271,751, there is an prophage are organized similarly and en-
the five Rhs elements (all previously de- IS911-related sequence (65% match), code several similar proteins, although
scribed), which are 5.7 to 9.6 kb in length which we term IS911A, interrupted by a they do not share the same attachment
and together comprise 0.8% of the genome. copy of IS30. At positions 4,504,683 to sites. We infer that CP4-6 is integrated
They have no known function, although 4,507,369, there is a more faithful copy of into tRNA gene thrW (60) because the 39

Downloaded from http://science.sciencemag.org/ on February 9, 2020


strain comparisons suggest they may be mo- IS911 (designated IS911B), which is also end of thrW is duplicated 34,242 bp down-
bile elements. The ;40-bp palindromic se- interrupted by a copy of IS30 as well as by a stream adjacent to b0281, a homolog of
quences variously referred to as REP, BIME, piece of IS600. This is the only IS600- several integrases. This prophage (posi-
or PU constitute the largest class of repeats. related sequence in the genome. We did not tions 262,122 to 296,489) includes argF, a
They are often found as tandem copies, find the copy of IS629 that had been sus- known “duplicate” gene in the arginine
alternating in orientation, in complexes pected from hybridization studies (58). biosynthesis pathway that has been sug-
called REP elements. We have located 581 Cryptic prophage and phage remnants. As gested to have been acquired through a
such sequences, in 314 REP elements con- originally isolated, E. coli K-12 carried transposition event (61). It also includes
taining from 1 to 12 tandem copies (see also bacteriophage lambda plus the defective the IS911A complex, a partial IS30 copy,
Fig. 1). These elements account for 0.54% lambdoid prophages DLP12, Rac, and two copies of IS1, and one copy of IS5.
of the genome and are of unknown origin Qin, the element e14, and the recently CP4-44 is less well defined (approximate
and function. These can be subdivided into described CP4-57 (59). Defective, or cryp- endpoints at positions 2,064,181 and
distinct classes, as described by Bachellier et tic, prophages have lost some functions 2,077,053) and we suspect that insertion
al. (53). Of the other known small dispersed essential for lytic growth and the produc- of the IS5 at its left end may have been
repeats, we find four new IRU (or ERIC) tion of infectious particles, but still retain accompanied by a deletion of part of the
elements, for a total of 19; four new copies other functional phage genes. They can prophage; although it shares other ORFs
of Box C, for a total of 33; and only the rescue mutations in related infecting bac- with CP4-6 and CP4-57, it has no candi-
previously described six copies of RSA. The teriophages by recombining with them to date integrase or associated direct repeats
distribution of some of these repeated se- generate viable hybrids. Figure 2 shows a that might be att sites.
quences may not be totally random; for histogram plot presenting all sequence A third new cryptic prophage is located
example, Box C is absent over a 1-Mbp matches to the phage proteins in SWISS- in the eut operon. Its presumptive integrase
span in replichore 2. PROT. In addition to clarifying the struc- (b2442) resembles that of phiR-73, Sf6, and
Another repeated sequence found in the ture of the known prophages, we identi- the CP4 family, but no other ORFs suggest
E. coli genome is the Ter sequence, which fied three new cryptic prophages. More- its inclusion in the CP4 group. The end-
acts as a one-way gate or valve to block the over, we found numerous instances of iso- points of the element (positions 2,556,711
progression of the DNA replication fork lated genes that are similar to bacteriophage and 2,563,508) were defined by comparison
such that replication starting from the origin genes. We call these single genes “phage with the sequence of Salmonella typhi-
is prevented from progressing beyond the remnants” to distinguish them from the larg- murium, from which the element is missing
terminus marked by the dif site (54). Fran- er cryptic prophages. Although this implies a (62). The 8-bp direct repeat TCAGGAAG
çois et al. (55) identified 10 different chro- phage origin—the last vestiges of a cryptic at the ends is present as a single copy in
mosomal fragments with homology to an prophage ravaged by deletions—these genes Salmonella. The W3110 sequence from the
oligomeric TerA probe, but only seven Ter may actually be homologs encoded by both a Japanese group (http://mol.genes.nig.ac.jp/
sequences (TerA through TerG) have been bacteriophage and its host, with no ready ecoli/) is missing this element, which, in
identified to date. We found two new copies indication as to which genome was the orig- light of the K-12 pedigree, suggests that this
of the 11-bp Ter core sequence TGTTGTA- inal carrier. element is able to excise.
ACTA, both of which are located and ori- We determined the precise endpoints
ented as expected relative to dif. of e14 in MG1655 (positions 1,195,432 Conclusion
The sequence named LDR (11) occurs and 1,210,646), including terminal 11-bp
as three tandem copies at positions direct repeats, from the published excised Although the determination of the com-
1,268,308 to 1,269,848; a lone fourth copy, element and e14-free chromosome se- plete E. coli sequence has required almost
shorter and diverged from the consensus of quences (GenBank accession numbers 6 years, this represents only the beginning
the other copies, is located at positions M19693 and M19683). The 1829-bp Pin of our understanding. Further research will
3,697,525 to 3,697,888. In the region be- invertable P-region of e14 is in the (–) be required to determine the precise func-
tween positions 2,875,665 and 2,902,430, a orientation in this sequence. The precise tions for all of the genes by global tran-

1460 SCIENCE z VOL. 277 z 5 SEPTEMBER 1997 z www.sciencemag.org


ARTICLE
scriptional analysis, phenotypic analysis of unexpected gaps in genome coverage. Candidate promoters using a low threshold of
16. Although the 1-mg yield of popout plasmid [G. Pósfai matches and 15 to 21 bp between –10 and –35 are
mutants, and analysis of biochemical and et al., Nucleic Acids Res. 22, 2392 (1994)] was low saved. A subset of best candidates are selected on
catalytic properties of the expressed pro- for early shotgun protocols, the assemblies were the basis of a context measure that compares alter-
teins. Another fruitful avenue for explora- successful when supplemented with lambda clone native candidates within a given region of 200 bp
and long-range PCR data. The main problem with upstream of each ORF. This includes a weight pref-
tion will lie in whole genome compari- extending this approach was the need to specifically erence for candidates located closer to the begin-
sons— both with related pathogens to engineer each popout plasmid by insertional recom- ning of the gene. The method can find zero, one, or
identify those genes that confer unique bination into the host. several promoters in a single region. Inside operons,
detrimental or beneficial properties, and 17. I–Sce I is a site-specific intron-encoded homing en- we only saved promoters where regulatory sites
donuclease from yeast [A. Perrin, M. Buckle, B. Du- were also found. Regulatory sites were searched
with other microbial genomes to ascertain jon, EMBO J. 12, 2939 (1993)], whose 18-bp non- with a combined weight matrix (when at least three
evolutionary relations. palindromic recognition site is absent from E. coli sequences are known) and a string search that al-
(C. A. Bloch and C. K. Rode, unpublished data). lows a fixed number of mismatches for each regula-
Single I–Sce I sites were introduced into MG1655 on tory site. To avoid overrepresentation of particular
REFERENCES AND NOTES a transposable element to produce a mapped col- sites, we adjusted the number of allowed mismatch-
___________________________
lection of strains, each with a unique I–Sce I site es such that the number of predicted sites did not
1. F. R. Blattner, Science 222, 719 (1983). Escherichia exceed 10 times the number of known sites for a
[C. K. Rode, V. H. Obreque, C. A. Bloch, Gene 166,
coli has been the subject of extensive monographs, given regulatory protein [D. A. Rosenblueth, D. Thief-
1 (1995); C. A. Bloch, C. K. Rode, V. H. Obreque, J.
the most recent of which is (2). fry, A. M. Huerta, H. Salgado, J. Collado-Vides,
Mahillon, Biochem. Biophys. Res. Commun. 223,
2. Escherichia coli and Salmonella Cellular and Molec- 104 (1996)]. P1 transduction was used to combine Comput. Appl. Biosci. 12, 415 (1997)].
ular Biology, F. C. Neidhardt et al., Eds. (ASM Press, sites in pairs, permitting isolation of I–Sce I fragments 27. P. M. Sharp and W. H. Li, Nucleic Acids Res. 15,
Washington, DC, 1996). as single bands by pulsed-field gel electrophoresis. 1281 (1987).
3. The publicly available complete genome sequences Sequencing confirmed the expected nine-base 28. H. Grosjean and W. Fiers, Gene 18, 199 (1982); T.
are those of Haemophilus influenzae Rd [R. D. overlap between adjacent fragments. Although the Ikemura, Mol. Biol. Evol. 2, 13 (1985).
Fleischmann et al., Science 269, 496 (1995)], Myco- background contamination for entire I–Sce I frag- 29. C. Médigue, T. Rouxel, P. Vigier, A. Henaut, A.
plasma genitalium [C. M. Fraser et al., ibid. 270, 397 ment shotguns ranged from 15 to 30%, we occa- Danchin, J. Mol. Biol. 222, 851 (1991).
(1995)], Methanococcus jannaschii [C. J. Bult et al.,

Downloaded from http://science.sciencemag.org/ on February 9, 2020


sionally observed individual preparative gels that 30. The zero reference (0/100, formerly 0/60) of the map
ibid. 273, 1058 (1996)], Mycoplasma pneumoniae seemed to have ,5% background, as assessed was originally defined as the position of the first
[ R. Himmelreich et al., Nucleic Acids Res. 24, 4420 from gel images. We therefore suspect that improve- marker (thr) transferred by E. coli Hfr H, which was
(1996)], Synechocystis sp. strain PCC6803 [ T. ments in gel handling and electrophoretic conditions used in genetic mapping by interrupted mating, and
Kaneko et al., DNA Res. 3, 109 (1996)], and Saccha- could improve the overall quality of the fragment a convention has arisen of using the first residue of
romyces cerevisiae [A. Goffeau et al., Science 274, preparations. the thrA gene as residue 1. However, this results in
546 (1996)]. 18. V. Burland, F. P. Curtis, N. Kusukawa, Biotech- placing the regulatory region of the thr operon at the
4. S.-E. Chuang, D. L. Daniels, F. R. Blattner, J. Bacte- niques 21, 142 (1996). opposite end of the 4.6-Mb sequence from the oper-
riol. 175, 2026 (1993); D. J. Lockart et al., Nature 19. Codon usage statistics [M. Borodovsky and J. Mc- on itself. We therefore defined nucleotide 1 as the A
Biotechnol. 14, 1675 (1996). Ininch, Comput. Chem. 17, 123 (1993); M. Gribs- residue 189 nucleotides upstream of the initiation
5. M. Riley and B. Labedan, J. Mol. Biol. 269, 1 (1997). kov, J. Devereux, R. R. Burgess, Nucleic Acids codon for thrL, the first gene on the genetic map. We
6. F. C. Neidhardt, in (2), vol. 2, pp. 1–3. Res. 12, 539 (1984)] were graphically displayed by did not detect any feature spanning this point.
7. B. Bachmann, in (2), vol. 2, pp. 2460 –2488. means of the program Geneplot (DNASTAR). Pro- 31. B. J. Brewer, in The Bacterial Chromosome, K. Drlica
8. K. F. Jensen, J. Bacteriol. 175, 3401 (1993). tein searches were to SWISS-PROT release 34 [A. and M. Riley, Eds. (American Society for Microbiolo-
9. R. P. Lawther et al., ibid. 149, 294 (1982). Bairoch and R. Apweiler, ibid. 24, 21 (1996)]. The gy, Washington, DC, 1990), pp. 61– 83.
10. D. Liu and P. R. Reeves, Microbiology 140, 49 Link database is described in A. J. Link, thesis, 32. C.-I. Wu and N. Maeda, Nature 327, 169 (1987);
(1994). Harvard University (1994). Signal peptide searches N. T. Perna and T. D. Kocher, J. Mol. Evol. 41, 353
11. T. Yura et al., Nucleic Acids Res. 20, 3305 (1992); N. used an unpublished BASIC program written by (1995).
Fujita, H. Mori, T. Yura, A. Ishihama, ibid. 22, 1637 F.R.B. Predictions for ribosomal binding sites were 33. J. R. Lobry, Mol. Biol. Evol. 13, 660 (1996); Science
(1994); T. Oshima et al., DNA Res. 3, 137 (1996); H. provided by W. S. Hayes and M. Borodovsky (per- 272, 745 (1996).
Aiba et al., ibid., p. 363; T. Itoh et al., ibid., p. 379. sonal communication). 34. L. R. Cardon, C. Burge, G. A. Schachtel, B. E. Blais-
12. V. Burland, D. L. Daniels, G. Plunkett III, F. R. Blatt- 20. M. Riley, Nucleic Acids Res. 25, 51 (1997). dell, S. Karlin, Nucleic Acids Res. 21, 3875 (1993);
ner, Nucleic Acids Res. 21, 3385 (1993). 21. P. Karp, M. Riley, S. M. Paley, A. Pellegrini-Toole, M. B. E. Blaisdell, K. E. Rudd, A. Matin, S. Karlin, J. Mol.
13. Six segments of the genome were sequenced using Krummenacker, ibid., p. 43. Biol. 229, 833 (1993).
radioactive chemistry (14) [ D. L. Daniels, G. Plunkett 22. Similarity searches were conducted using both the 35. K. Yoda, H. Yasuda, X. W. Xiang, T. Okazaki, Nucleic
III, V. Burland, F. R. Blattner, Science 257, 771 DeCypher II hardware-software system ( Time Log- Acids Res. 16, 6531 (1988); H. Hiasa et al., Gene 84,
(1992); G. Plunkett III, V. Burland, D. L. Daniels, F. R. ic Inc., Incline Village, NV ) and the PepPepSearch 9 (1989); K. Yoda and T. Okazaki, Mol. Gen. Genet.
Blattner, Nucleic Acids Res. 21, 3391 (1993); F. R. program of the Darwin suite at Zurich, http:// 227, 1 (1991); J. R. Swart and M. A. Griep, J. Biol.
Blattner, V. Burland, G. Plunkett III, H. J. Sofia, D. L. cbrg.inf.ethz.ch/ [G. H. Gonnet, M. A. Cohen, S. A. Chem. 268, 12970 (1993).
Daniels, ibid., p. 5408; H. J. Sofia, V. Burland, D. L. Benner, Science 256, 1443 (1992)]. PepPep- 36. T.-C. V. Wang and S.-H. Chen, Biochem. Biophys.
Daniels, G. Plunkett III, F. R. Blattner, ibid. 22, 2576 Search returns up to 30 hit sequences per query, Res. Commun. 184, 1496 (1992); ibid. 198, 844
(1994); V. Burland, G. Plunkett III, H. J. Sofia, D. L. and returns each pairwise alignment and the cor- (1994).
Daniels, F. R. Blattner, ibid. 23, 2105 (1995)]. We responding PAM scores. For most of the cases, 37. The major recombination pathway in E. coli is the
determined experimentally that deoxyinosine only matches with PAM , 200 were used. See B. RecBCD pathway, so called because of the central
triphosphate (dITP) is the most effective analog for Labedan and M. Riley, Mol. Biol. Evol. 12, 980 involvement of the enzyme encoded by the recBCD
resolving G-C compressions, although it also causes (1995). genes. For a review of RecBCD-mediated recombi-
premature termination. With radioactive sequencing, 23. S. F. Altschul, W. Gish, W. Miller, E. W. Myers, D. J. nation, see F. Stahl and R. Myers, J. Hered. 86, 327
a dITP sequence lane must be run in addition to, Lipman, J. Mol. Biol. 215, 403 (1990). (1995); see also (38). For a review of recombination-
rather than in place of, a deoxyguanosine triphos- 24. K. Kashiwagi, Y. Yamaguchi, Y. Sakai, H. Kobayashi, deficient variants of Chi, see D. W. Schultz, J. Swin-
phate (dGTP) run. For efficiency in the areas of E. coli K. Igarashi, J. Biol. Chem. 265, 8387 (1990). dle, G. R. Smith, J. Mol. Biol. 146, 275 (1981).
we sequenced radioactively, tiling software was 25. Y. Lu, C. Flaherty, W. Hendrickson, ibid. 267, 24848 38. A. Kuzminov, Mol. Microbiol. 16, 373 (1995).
used to select a minimal set of M13 clones for rese- (1992). 39. C. Burge, A. M. Campbell, S. Karlin, Proc. Natl.
quencing with dITP after the bulk of the assembly 26. Using the database of 392 known operons that we Acad. Sci. U.S.A. 89, 1358 (1992); M. McClelland
had been completed with dGTP. On the other hand, have localized in the genome sequence, we first pre- and A. S. Bhagwat, Nature 355, 595 (1992); A. S.
because prematurely terminated chains are not la- dicted operons on the basis of the functional class Bhagwat and M. McClelland, Nucleic Acids Res. 20,
beled by the fluorophore with dye-terminator fluores- conservation within genes of an operon. This gives a 1663 (1992); R. Merkl, M. Kroger, P. Rice, H. J. Fritz,
cent sequencing, dITP can substitute totally for better prediction (68% positive prediction) than the ibid., p. 1657; S. Karlin and L. R. Cardon, Annu. Rev.
dGTP and can be used for all routine data collection. method of predicting operons on the basis of the Microbiol. 48, 619 (1994).
14. V. Burland, G. Plunkett III, D. L. Daniels, F. R. Blatt- distance of genes inside operons versus the dis- 40. C. Médigue, A. Viari, A. Hénaut, A. Danchin, Mol.
ner, Genomics 16, 551 (1993). tance between operons (59% positive prediction). Microbiol. 5, 2629 (1991).
15. D. L. Daniels, in The Bacterial Chromosome, K. We predicted 2281 operons by functional class con- 41. R. P. Burlingame, L. Wyman, P. J. Chapman, J.
Drlica and M. Riley, Eds. (American Society for Mi- servation and predicted the remainder with unclas- Bacteriol. 168, 55 (1986); T. D. H. Bugg, Biochim.
crobiology, Washington, DC, 1990), pp. 43–51. It sified genes, using 50 bp as the distance criterion. Biophys. Acta 1202, 258 (1993); E. Spence, M.
was often necessary to resequence overlapping re- The strategy found to give the highest number of Kawamukai, J. Sanvoisin, H. Braven, T. Bugg, J.
gions between adjacent clones, and screening to positive promoter predictions (;40% when tested Bacteriol. 178, 5249 (1996).
remove lambda vector sequences before sequenc- with an independent set of known promoters) in- 42. H. M. Tan, H. Y. Tang, C. L. Joannou, N. H. Abdel-
ing was costly. Occasionally we found deleted, mis- volves an initial search with a pair of weight matrices, Wahab, J. R. Manson, Gene 130, 33 (1993).
mapped, or chimeric lambda clones that created one for the –10 region and one for the –35 region. 43. R. M. Macnab, in (2), vol. 2, pp. 123–145; M.

www.sciencemag.org z SCIENCE z VOL. 277 z 5 SEPTEMBER 1997 1461


Homma, D. J. DeRosier, R. M. Macnab, J. Mol. Biol. Tatusov et al., Curr. Biol. 6, 279 (1996)] is based on ly D. L. Daniels and N. Peterson, who were present
213, 819 (1990); K. Ohnishi, Y. Ohto, S. Aizawa, less restrictive criteria and includes sequences with at the creation. We also thank R. Straussburg and
R. M. Macnab, T. Iino, J. Bacteriol. 176, 2272 (1994); as little as 18% identity. M. Guyer, our program administrators; R. R. Bur-
For a discussion of mviM and mviN, see K. Kut- 50. K. M. Abdullah, R. Y. Lo, A. Mellors, J. Bacteriol. 173, gess and M. Sussman for critical reading of the
sukake, T. Okada, T. Yokoseki, T. Iino, Gene 143, 49 5597 (1991). manuscript; M. Borodovsky and W. S. Hayes for
(1994). 51. S. Ohno, Evolution by Gene Duplication (Springer- application of a new version of the GeneMark pro-
44. For a discussion of AT T start in infC, see C. Sacerdot Verlag, Berlin, 1970). gram to the analysis of the sequence; K. Rudd for
et al., EMBO J. 1, 311 (1982); for a discussion of 52. J. D. Gralla and J. Collado-Vides, in (2), vol. 1, pp. his Ecoseq7 melds of GenBank data; J. Mahillon
CTG start in htg A, see D. Missiakas, C. Georgopou- 1232–1244. for providing I–Sce I strains; J. Roth and E. Kofoid
los, S. Raina, J. Bacteriol. 175, 2613 (1993). 53. S. Bachellier, E. Gilson, M. Hofnung, C. W. Hill, in (2), for unpublished Salmonella data; the Japanese
45. D. L. Daniels, F. Sanger, A. R. Coulson, Cold Spring vol. 2, pp. 2012–2040. group under H. Mori and T. Horiuchi for coopera-
Harbor Symp. Quant. Biol. 47, 1009 (1983); F. 54. T. M. Hill, in (2), vol. 2, pp. 1602–1612. tive competition; G. Pósfai and W. Szybalski for the
Sanger, A. R. Coulson, G. F. Hong, D. F. Hill, G. B. 55. V. François, J. Louarn, J.-M. Louarn, Mol. Microbiol. popout strains; S. Baldwin, C. Allex, N. Manola, G.
Petersen, J. Mol. Biol. 162, 729 (1982). 3, 995 (1989). Bouriakov, and J. Schroeder of DNASTAR for ex-
46. A number of bacterial proteins have been implicated 56. A. M. Nakata, M. Amemura, K. Makino, J. Bacteriol. traordinary software; A. Huerta, H. Salgado, and D.
in mediating the invasion of host cells by pathogens. 171, 3553 (1989). Thieffry for help with promoter, operon, and regu-
Attaching and effacing proteins are involved in elicit- 57. R. C. Deonier, in (2), vol. 2, pp. 2000 –2011. latory site identification; T. Thiesen for Postscript
ing an extensive rearrangement of host cell actin by 58. S. Matsutani and E. Ohtsubo, Gene 127, 111 (1993). illustrations; H. Kijenski, G. Peyrot, P. Soni, G. Di-
enteropathogenic E. coli strains, whereas invasins 59. For a review of K-12 prophage, see A. M. Campbell, arra, E. Grotbeck, T. Forsythe, M. Maguire, M. Fed-
are bacterial surface proteins that provoke the endo- in (2), vol. 2, pp. 2041–2046. CP4-57 is described in erle, S. Subramanian, and K. Kadner for excellent
cytic uptake of Yersinia and Salmonella spp. by host D. M. Retallack, L. L. Johnson, D. I. Friedman, J. technical work; and 169 University of Wisconsin
cells. For an overview of bacterial pathogenesis, in- Bacteriol. 176, 2082 (1994); J. E. Kirby, J. E. undergraduates who participated over the last de-
cluding virulence factors, see A. A. Salyers and D. D. Trempy, S. Gottesman, ibid., p. 2068. cade. Supported by NIH grants P01 HG01428
Whitt, Bacterial Pathogenesis: A Molecular Ap- 60. P22 [D. F. Lindsey, C. Martinez, J. R. Walker, J. (from the Human Genome Project) and S10
proach (ASM Press, Washington, DC, 1994). Bacteriol. 174, 3834 (1992)] and a phage from a RR10379 (for ABI machines from the National Cen-
47.
48. iiii
M. Riley, Microbiol. Rev. 57, 862 (1993).
and B. Labedan, in (2), vol. 2, pp. 2118 –
clinical isolate [D. Lim, Mol. Microbiol. 6, 3531
(1992)] also integrate into thrW.
ter for Research Resources–Biomedical Research
Support Shared Instrumentation Grant). We thank

Downloaded from http://science.sciencemag.org/ on February 9, 2020


2202. 61. F. Van Vliet, A. Boyen, N. Glansdorff, Ann. Inst. Pas- IBM for the gift of workstations, the State of Wis-
49. Relations among these eubacteria are estimated by teur Microbiol. 139, 493 (1988). consin for remodeling support, and especially
a rRNA phylogeny [G. J. Olsen, C. R. Woese, R. 62. E. Kofoid and J. Roth, personal communication. SmithKline Beecham Pharmaceuticals and Ge-
Overbeek, J. Bacteriol. 176, 1 (1994)]. A previous 63. This is Laboratory of Genetics paper 3487. We nome Therapeutics Corp. for financial support of
estimate of 1128 Haemophilus influenzae orthologs thank the entire E. coli community for their support, the annotation of this sequence. N.P. is an NSF
among 75% of the complete E. coli genome [ R. L. encouragement, and sharing of data, and especial- fellow in molecular evolution.

Fig. 3 (foldout). Map of the complete E. coli sequence, its features and similarities to proteins from five
other complete genome sequences, proceeding from left to right in 42 tiers. The top line shows each
gene or hypothetical gene, color-coded to represent its known or predicted function as assigned on the
basis of biochemical and genetic data. Genes are vertically offset to indicate their direction of transcrip-
tion. Space permitting, names of previously described E. coli genes are indicated above the line. The
second line contains arrows indicating documented (red) and predicted (black) operons. Documented
operons encoding stable RNAs are blue. Line 3, below the operons, contains tick marks showing the
position of documented (red), predicted (black), and stable RNA (blue) promoter sequences. Line 4
consists of tick marks showing the position of documented (red) and predicted (black) protein binding
sites. Lines 5 to 9 are histograms showing the results of alignments between E. coli proteins and the
products encoded by five other complete genomes. The height of each bar is a simple index of similarity:
the product of the percent of each protein in the pairwise alignment and the percent amino acid identity
across the aligned region. Line 10 indicates similarity among proteins in E. coli in the same fashion. Line
11 histograms show the logarithm of the number of proteins in the E. coli genome that match a particular
protein. Line 12 in each tier is a histogram that indicates the CAI of each ORF. Genes with intermediate
CAI values are shown in orange, genes with high CAI values (.90th percentile) are a darker shade of
orange, genes with low CAI values (,10th percentile) are light brown, and clusters of four or more genes
with low CAI values (,0.25) are yellow. The final line in each tier is a scale showing position (in base
pairs).

1462 SCIENCE z VOL. 277 z 5 SEPTEMBER 1997 z www.sciencemag.org


thrL thrC talB htgA dnaJ nhaA ileS dapB carB kefC ilvI murE ftsW ddlB lpxC mutT g
caiC ksgA imp hepA polB tbpA leuD

genes
operons
promoters
PB sites
Haemophilus
Synechocystis
Mycoplasma
Methanococcus
Saccharomyces
Best E. coli hit
log(E. coli hits)

CAI
0 100,000

glnS seqA rhsC phrB nei sdhC sucA hrsA cydA tolQ pal nadA a
leuS gltL lnt asnB speF kdpD gltA

genes
operons
promoters
PB sites
Haemophilus
Synechocystis
Mycoplasma
Methanococcus
Saccharomyces
Best E. coli hit
log(E. coli hits)

CAI
700,000

aldH goaG pspE tyrR dbpA sieB trkG


rnb sapF sapA fnr lar racC hslJ tynA

genes
operons

Downloaded from http://science.sciencemag.org/ on February 9, 2020


promoters
PB sites
Haemophilus
Synechocystis
Mycoplasma
Methanococcus
Saccharomyces
Best E. coli hit
log(E. coli hits)

CAI
1,400,000

fliL rcsA asnT amn asnU asnV sbcB hisL hisB


vsr cobU sbmC gnd rfc rfbX galF cpsG

genes
operons
promoters
PB sites
Haemophilus
Synechocystis
Mycoplasma
Methanococcus
Saccharomyces
Best E. coli hit
log(E. coli hits)

CAI
2,100,000

nadB srmB ung pssA pheL recN intA alpA gabD


purL acpS era rseC rpoE rrsG clpB rplS grpE ileY stpA

genes
operons
promoters
PB sites
Haemophilus
Synechocystis
Mycoplasma
Methanococcus
Saccharomyces
Best E. coli hit
log(E. coli hits)

CAI
2,700,000

gltF hhoA argR accB panF acrF def trkA hofF pshM
nanT sspA mdh tldD cafA envR rrfD rrsD smg rpoA prlA rpsH rpmC rplB pinO tufA r

genes
operons
promoters
PB sites
Haemophilus
Synechocystis
Mycoplasma
Methanococcus
Saccharomyces
Best E. coli hit
log(E. coli hits)

CAI
3,400,000

pepQ rrsA rrfA dsbA polA hemN sodA pfkA cdh rpmE metL katG
glnL fdoG rhaD cpxA glpX hslU cytR gldA ptsA

genes
operons
promoters
PB sites
Haemophilus
Synechocystis
Mycoplasma
Methanococcus
Saccharomyces
Best E. coli hit
log(E. coli hits)

CAI
4,100,000

Pages sequence horizontally 1


ddlB lpxC mutT guaC ampD pdhR lpdA acnB hrpB mrcB htrA rpsB cdsA hlpA rnhB accA mesJ cutF rrsH
ppdD speD gcd panD htrE pcnB sfsA hemL pfs glnD rcsF abc

200,000

tolQ pal nadA aroG modB bioF uvrB moaA rhlE dinG ompX dacC mdaA potF
modF bioA glnQ dps moeB grxA

800,000

hrpA aldA cybB rimL tehB rhsE fdnG osmC


tynA acpD narV narU sfcA xasA

Downloaded from http://science.sciencemag.org/ on February 9, 2020


1,500,000

baeS metG dld cdd


cpsG wcaB dcd alkA gatC mrp bglX mglC galS

2,200,000

gabD nrdE proV emrR srlA gutM hypA mutS iap


stpA gshA alaS mltB hypF hycI rpoS pcm cysC cysH eno relA

2,800,000 2,900,000

pshM slyX prkB nirB cysG mrcA pckA greB gntT malT
tufA rpsG argD ppiA trpS dam aroB hofQ bioH glpR glgP glgC asd gntK gg

3,500,000

katG pflD btuB rrsB rrfB tufB rplK rpoB rpoC hemE rrsE rrfE aceB metH pgi lamB ubiC lexA
gldA ptsA ppc udhA trmA thiF purD arp pepE lysC xylE plsB

4,200,000

Pages sequence horizontally 2


cutF rrsH rrfH dnaQ gmhA mbhA prfH gpt crl proB eaeH betT
rcsF abc dniR argF betA

300,000

mdaA potF clpA lrp lolA dmsA serC cmk himD msbA kdsB smtA pepN pyrD
grxA artM poxB aqpZ cspD aat trxB pflA aspC asnS

900,000 1,000,000

marR dicA speG asr


xasA hipA uxaB dcp nohA cspB rem rspB mlc pntB

Downloaded from http://science.sciencemag.org/ on February 9, 2020


1,600,000

cdd nfo rplY narP eco rcsB atoS nrdA


mglC galS cirA fruK bcr ccmF napC alkB gyrA inaA

2,300,000

barA sdaC exo argA mutH galR


eno relA syd gcvA mltA ptr recC thyA ptsP aas araE glyU

3,000,000

nikA rhsB pitA gor arsC slp treF kdgK


asd gntK ggt livF livH rpoH prlC gadA dctA

3,600,000

lamB ubiC lexA alr ssb nrfA gltP proP melB


plsB qor uvrA acs fdhF rpiR phnL adiY melR dcuB lysU cadA dsbD

4,300,000

Pages sequence horizontally 3


codB mhpB sbmA phoA aroL phoB brnQ malZ secD ribD bolA clpP
lacZ adhC hemB ddlA proC araJ sbcC tsx ispA thiJ cyoE ampG

400,000

pqiA rmf helD hyaA appC cspG agp phoH mdoG rimJ
sulA torS cbpA putA csgG htrB dinI grxB

1,100,000

rstA tus manA malX add nth gst rnt sodB purR cfa pykF aroD aroH
gusC gusR pdxH sodC ppsA nlpC pheT infC

Downloaded from http://science.sciencemag.org/ on February 9, 2020


1,700,000

glpB ackA div argW


ais menE menF nuoN nuoG lrhA hisP argT purF accD usg aroC vacJ

2,400,000

iciA metK sprT gshB mutY nupG


lysS recJ gcvP visC pepP serA pgk cmtA tktA speB ansB glcB pitB gsp

3,100,000

tag cspA xylR avtA lyxK rhsA mtlA lldP rfaD rfaL kdtA dut
dppF dppA glyS xylB aldB selB cysE tdh rfaI radC

3,700,000 3,800,000

efp amiB miaA hflC vacB aidB rpsF cycA chpS pmbA mgtA
aspA frdB psd cpdB msrA treC argI valS pepA

4,400,000

Pages sequence horizontally 4


bolA clpP lon cof glnK apt recR adk gsk ushA rhsD gcl cysS
cyoE ampG hha acrB priC tesA purK intD

500,000

rimJ flgF flgK rpmF acpP tmk ptsG ndh pepT icdA lit pin umuD dadA
trB dinI grxB flgM rne mfd potD purB minE nhaB

1,200,000

roH pfkB katE nadE xthA gdhA sppA gapA pabB manX
nlpC pheT infC celF topB rnd cspC htpX

Downloaded from http://science.sciencemag.org/ on February 9, 2020


1,800,000 1,900,000

argW evgA nupC cysZ ptsI amiA tktB


vacJ dsdC emrY glk gltX xapB lig cysM cysP eutC eutH cchB

2,500,000

metC mdaB tolC cca ttdA rpsU rpoD ebgR exuT


pitB gsp exbD sufI parC icc ribB glgS glnE air uxaA

3,200,000

kdtA dut dinD gmk recG emrD rpmH thdF


radC nlpA uhpT ilvN glvG hslS gyrB dnaN bglB phoU glmS atpC atpA gidB mioC

3,800,000 3,900,000

gntV intB fimE fimF uxuA mcrD mrr tsr


argI valS pepA fecE fecA gntP iadA hsdS hsdR mdoB

4,500,000

Pages sequence horizontally 5


nohB appY pheP entF entE ahpC criR cspE
intD nmpC nfrA entD fepG rna dacA mrdA holA

genes
operons
promoters
PB sites
Haemophilus
Synechocystis
Mycoplasma
Methanococcus
Saccharomyces
Best E. coli hit
log(E. coli hits)

CAI
600,000

muD dadA prfA kdsA hnr tdk tonB sohB cysB pyrF
nhaB treA pth prsA narL tpr adhE cls trpA btuR ribA

genes
operons
promoters
PB sites
Haemophilus
Synechocystis
Mycoplasma
Methanococcus
Saccharomyces
Best E. coli hit
log(E. coli hits)

CAI
1,300,000

nX holE purT pykA argS tyrP amyA fliF


cspC htpX edd zwf ruvB ntpA bisZ cheZ tap tar motB araH uvrC sdiA

genes
operons
promoters
PB sites

Downloaded from http://science.sciencemag.org/ on February 9, 2020


Haemophilus
Synechocystis
Mycoplasma
Methanococcus
Saccharomyces
Best E. coli hit
log(E. coli hits)

CAI
,900,000 2,000,000

tktB narQ dapE bcp purM ppx xseA sseA suhB hmpA
purC uraA guaA hisS ndk hscA glyA

genes
operons
promoters
PB sites
Haemophilus
Synechocystis
Mycoplasma
Methanococcus
Saccharomyces
Best E. coli hit
log(E. coli hits)

CAI
2,600,000

uT sohA agaB argG dacB ispB rpoN gltB


tdcC deaD pnp infB mrsA hflB rpmA arcB

genes
operons
promoters
PB sites
Haemophilus
Synechocystis
Mycoplasma
Methanococcus
Saccharomyces
Best E. coli hit
log(E. coli hits)

CAI
3,300,000

kup rbsC rrsC rrfC ilvL ilvA rep trxA rfe rffE rffH rffM aslB cyaA dapF uvrD corA pldA pldB udp ubiB
atpA gidB mioC ppiC gppA hemY

genes
operons
promoters
PB sites
Haemophilus
Synechocystis
Mycoplasma
Methanococcus
Saccharomyces
Best E. coli hit
log(E. coli hits)

CAI
4,000,000

tsr holD osmY deoC deoD serB nadR trpR creA lasT
mdoB lplA arcA
Gene Function Coding
genes (ORFs and RNAs)
operons Regulatory function DNA replication, recombination, modification, and repair
promoters Putative regulatory proteins Transcription, RNA synthesis, metabolism, and modification
protein binding sites Cell structure Translation and posttranslational protein modification
Haemophilus influenzae Putative membrane proteins Cell processes (including adaptation and protection)
Synechocystis sp. Putative structural proteins Biosynthesis of cofactors, prosthetic groups, and carriers
Mycoplasma genitalium Phage, transposons, plasmids Nucleotide biosynthesis and metabolism
Methanococcus jannaschii Transport and binding proteins Amino acid biosynthesis and metabolism
Saccharomyces cerevisiae Putative transport proteins Fatty acid and phospholipid metabolism
Best match in E.coli Energy metabolism Central intermediary metabolism
log(Number of E.coli Putative chaperones Carbon compound catabolism
matches) Putative enzymes Hypothetical, unclassified, unknown
Codon Adaptation Index Other known genes tRNAs, rRNAs, and misc. RNAs
4,600,000

Pages sequence horizontally 6


The Complete Genome Sequence of Escherichia coli K-12
Frederick R. Blattner, Guy Plunkett III, Craig A. Bloch, Nicole T. Perna, Valerie Burland, Monica Riley, Julio Collado-Vides, Jeremy
D. Glasner, Christopher K. Rode, George F. Mayhew, Jason Gregor, Nelson Wayne Davis, Heather A. Kirkpatrick, Michael A.
Goeden, Debra J. Rose, Bob Mau and Ying Shao

Science 277 (5331), 1453-1462.


DOI: 10.1126/science.277.5331.1453

Downloaded from http://science.sciencemag.org/ on February 9, 2020


ARTICLE TOOLS http://science.sciencemag.org/content/277/5331/1453

REFERENCES This article cites 76 articles, 21 of which you can access for free
http://science.sciencemag.org/content/277/5331/1453#BIBL

PERMISSIONS http://www.sciencemag.org/help/reprints-and-permissions

Use of this article is subject to the Terms of Service

Science (print ISSN 0036-8075; online ISSN 1095-9203) is published by the American Association for the Advancement of
Science, 1200 New York Avenue NW, Washington, DC 20005. The title Science is a registered trademark of AAAS.
Copyright © 1997 The Authors, some rights reserved; exclusive licensee American Association for the Advancement of Science.
No claim to original U.S. Government Works.

You might also like