Professional Documents
Culture Documents
Genome E Coli k12 PDF
Genome E Coli k12 PDF
The 4,639,221– base pair sequence of Escherichia coli K-12 is presented. Of 4288 The first 1.92 Mb (13, 14), positions
protein-coding genes annotated, 38 percent have no attributed function. Comparison 2,686,777 to 4,639,221 [in base pairs (bp)],
with five other sequenced microbes reveals ubiquitous as well as narrowly distributed was sequenced from our overlapping set of
gene families; many families of similar genes within E. coli are also evident. The largest 15- to 20-kb MG1655 lambda clones (15)
family of paralogous proteins contains 80 ABC transporters. The genome as a whole is by means of radioactive chemistry and was
strikingly organized with respect to the local direction of replication; guanines, oligo- deposited in GenBank between 1992 and
nucleotides possibly related to replication and recombination, and most genes are so 1995. Subsequently, we switched to dye-
oriented. The genome also contains insertion sequence (IS) elements, phage remnants, terminator fluorescence sequencing (Ap-
and many other patches of unusual composition indicating genome plasticity through plied Biosystems). In addition to greater
0
Re
pli
ch
or
e
90
1
10
Or
00
igi
4,0
n
80
1,00
20
0,000
E. coli
K-12 MG1655
4,639,221 bp
70
30
3,0
Te
rm
00
inu
,0
s
00
60
ep 40
R
li c
ho
r e
2 00
0,0
50 2 , 00
Fig. 1. The overall structure of the E. coli genome. The origin and terminus of and tRNA genes are shown as green arrows. The next circle illustrates the
replication are shown as green lines, with blue arrows indicating replichores 1 positions of REP sequences around the genome as radial tick marks. The
and 2. A scale indicates the coordinates both in base pairs and in minutes central orange sunburst is a histogram of inverse CAI (1 – CAI), in which long
(actually centisomes, or 100 equal intervals of the DNA). The distribution of yellow rays represent clusters of low (,0.25) CAI. The CAI plot is enclosed by
genes is depicted on two outer rings: The orange boxes are genes located on a ring indicating similarities between previously described bacteriophage pro-
the presented strand, and the yellow boxes are genes on the opposite strand. teins and the proteins encoded by the complete E. coli genome; the similarity
Red arrows show the location and direction of transcription of rRNA genes, is plotted as described in Fig. 3 for the complete genome comparisons.
1st
position
2nd
position
3rd
position
All
positions
leftward
intergenic
rightward
genes
Chi
8-mer
Rhs
REP
IRU
Box C
RSA
Ter
LDR
iap
IS1
IS2
IS3
IS4
IS5
IS150
IS186
IS30
IS600
IS911
Phage
EcoK
Terminus Origin
0 1,000,000 2,000,000 3,000,000 4,000,000
Position (base pairs)
Fig. 2. Base composition is not randomly distributed in the genome. G-C next 18 horizontal lines correspond to distinct classes of repetitive
skew [(G – C)/(G 1 C)] is plotted as a 10-kb window average for one strand elements. The penultimate line contains a histogram showing the simi-
of the entire E. coli genome. Skew plots for the three codon positions are larity (the product of the percent of each protein in the pairwise alignment
presented separately; leftward genes, rightward genes, and non–protein- and the percent amino acid identity across the aligned region) of known
coding regions are shown in lines 5, 6, and 7. The two horizontal lines phage proteins to the proteins encoded by the complete E. coli genome.
below the skew plots show the distribution of two highly skewed octamer The last line indicates the position and orientation of the EcoK restriction-
sequences, GCTGGTGG (Chi) and GCAGGGCG (8-mer). Tick marks indi- modification site AACNNNNNNGTGC (N, any nucleotide). Two vertical
cate the position of each copy of a sequence in the complete genome and lines through the plots show the location of the origin and terminus of
are vertically offset to indicate the strand containing the sequence. The replication.
Fig. 3 (foldout). Map of the complete E. coli sequence, its features and similarities to proteins from five
other complete genome sequences, proceeding from left to right in 42 tiers. The top line shows each
gene or hypothetical gene, color-coded to represent its known or predicted function as assigned on the
basis of biochemical and genetic data. Genes are vertically offset to indicate their direction of transcrip-
tion. Space permitting, names of previously described E. coli genes are indicated above the line. The
second line contains arrows indicating documented (red) and predicted (black) operons. Documented
operons encoding stable RNAs are blue. Line 3, below the operons, contains tick marks showing the
position of documented (red), predicted (black), and stable RNA (blue) promoter sequences. Line 4
consists of tick marks showing the position of documented (red) and predicted (black) protein binding
sites. Lines 5 to 9 are histograms showing the results of alignments between E. coli proteins and the
products encoded by five other complete genomes. The height of each bar is a simple index of similarity:
the product of the percent of each protein in the pairwise alignment and the percent amino acid identity
across the aligned region. Line 10 indicates similarity among proteins in E. coli in the same fashion. Line
11 histograms show the logarithm of the number of proteins in the E. coli genome that match a particular
protein. Line 12 in each tier is a histogram that indicates the CAI of each ORF. Genes with intermediate
CAI values are shown in orange, genes with high CAI values (.90th percentile) are a darker shade of
orange, genes with low CAI values (,10th percentile) are light brown, and clusters of four or more genes
with low CAI values (,0.25) are yellow. The final line in each tier is a scale showing position (in base
pairs).
genes
operons
promoters
PB sites
Haemophilus
Synechocystis
Mycoplasma
Methanococcus
Saccharomyces
Best E. coli hit
log(E. coli hits)
CAI
0 100,000
glnS seqA rhsC phrB nei sdhC sucA hrsA cydA tolQ pal nadA a
leuS gltL lnt asnB speF kdpD gltA
genes
operons
promoters
PB sites
Haemophilus
Synechocystis
Mycoplasma
Methanococcus
Saccharomyces
Best E. coli hit
log(E. coli hits)
CAI
700,000
genes
operons
CAI
1,400,000
genes
operons
promoters
PB sites
Haemophilus
Synechocystis
Mycoplasma
Methanococcus
Saccharomyces
Best E. coli hit
log(E. coli hits)
CAI
2,100,000
genes
operons
promoters
PB sites
Haemophilus
Synechocystis
Mycoplasma
Methanococcus
Saccharomyces
Best E. coli hit
log(E. coli hits)
CAI
2,700,000
gltF hhoA argR accB panF acrF def trkA hofF pshM
nanT sspA mdh tldD cafA envR rrfD rrsD smg rpoA prlA rpsH rpmC rplB pinO tufA r
genes
operons
promoters
PB sites
Haemophilus
Synechocystis
Mycoplasma
Methanococcus
Saccharomyces
Best E. coli hit
log(E. coli hits)
CAI
3,400,000
pepQ rrsA rrfA dsbA polA hemN sodA pfkA cdh rpmE metL katG
glnL fdoG rhaD cpxA glpX hslU cytR gldA ptsA
genes
operons
promoters
PB sites
Haemophilus
Synechocystis
Mycoplasma
Methanococcus
Saccharomyces
Best E. coli hit
log(E. coli hits)
CAI
4,100,000
200,000
tolQ pal nadA aroG modB bioF uvrB moaA rhlE dinG ompX dacC mdaA potF
modF bioA glnQ dps moeB grxA
800,000
2,200,000
2,800,000 2,900,000
pshM slyX prkB nirB cysG mrcA pckA greB gntT malT
tufA rpsG argD ppiA trpS dam aroB hofQ bioH glpR glgP glgC asd gntK gg
3,500,000
katG pflD btuB rrsB rrfB tufB rplK rpoB rpoC hemE rrsE rrfE aceB metH pgi lamB ubiC lexA
gldA ptsA ppc udhA trmA thiF purD arp pepE lysC xylE plsB
4,200,000
300,000
mdaA potF clpA lrp lolA dmsA serC cmk himD msbA kdsB smtA pepN pyrD
grxA artM poxB aqpZ cspD aat trxB pflA aspC asnS
900,000 1,000,000
2,300,000
3,000,000
3,600,000
4,300,000
400,000
pqiA rmf helD hyaA appC cspG agp phoH mdoG rimJ
sulA torS cbpA putA csgG htrB dinI grxB
1,100,000
rstA tus manA malX add nth gst rnt sodB purR cfa pykF aroD aroH
gusC gusR pdxH sodC ppsA nlpC pheT infC
2,400,000
3,100,000
tag cspA xylR avtA lyxK rhsA mtlA lldP rfaD rfaL kdtA dut
dppF dppA glyS xylB aldB selB cysE tdh rfaI radC
3,700,000 3,800,000
efp amiB miaA hflC vacB aidB rpsF cycA chpS pmbA mgtA
aspA frdB psd cpdB msrA treC argI valS pepA
4,400,000
500,000
rimJ flgF flgK rpmF acpP tmk ptsG ndh pepT icdA lit pin umuD dadA
trB dinI grxB flgM rne mfd potD purB minE nhaB
1,200,000
roH pfkB katE nadE xthA gdhA sppA gapA pabB manX
nlpC pheT infC celF topB rnd cspC htpX
2,500,000
3,200,000
3,800,000 3,900,000
4,500,000
genes
operons
promoters
PB sites
Haemophilus
Synechocystis
Mycoplasma
Methanococcus
Saccharomyces
Best E. coli hit
log(E. coli hits)
CAI
600,000
muD dadA prfA kdsA hnr tdk tonB sohB cysB pyrF
nhaB treA pth prsA narL tpr adhE cls trpA btuR ribA
genes
operons
promoters
PB sites
Haemophilus
Synechocystis
Mycoplasma
Methanococcus
Saccharomyces
Best E. coli hit
log(E. coli hits)
CAI
1,300,000
genes
operons
promoters
PB sites
CAI
,900,000 2,000,000
tktB narQ dapE bcp purM ppx xseA sseA suhB hmpA
purC uraA guaA hisS ndk hscA glyA
genes
operons
promoters
PB sites
Haemophilus
Synechocystis
Mycoplasma
Methanococcus
Saccharomyces
Best E. coli hit
log(E. coli hits)
CAI
2,600,000
genes
operons
promoters
PB sites
Haemophilus
Synechocystis
Mycoplasma
Methanococcus
Saccharomyces
Best E. coli hit
log(E. coli hits)
CAI
3,300,000
kup rbsC rrsC rrfC ilvL ilvA rep trxA rfe rffE rffH rffM aslB cyaA dapF uvrD corA pldA pldB udp ubiB
atpA gidB mioC ppiC gppA hemY
genes
operons
promoters
PB sites
Haemophilus
Synechocystis
Mycoplasma
Methanococcus
Saccharomyces
Best E. coli hit
log(E. coli hits)
CAI
4,000,000
tsr holD osmY deoC deoD serB nadR trpR creA lasT
mdoB lplA arcA
Gene Function Coding
genes (ORFs and RNAs)
operons Regulatory function DNA replication, recombination, modification, and repair
promoters Putative regulatory proteins Transcription, RNA synthesis, metabolism, and modification
protein binding sites Cell structure Translation and posttranslational protein modification
Haemophilus influenzae Putative membrane proteins Cell processes (including adaptation and protection)
Synechocystis sp. Putative structural proteins Biosynthesis of cofactors, prosthetic groups, and carriers
Mycoplasma genitalium Phage, transposons, plasmids Nucleotide biosynthesis and metabolism
Methanococcus jannaschii Transport and binding proteins Amino acid biosynthesis and metabolism
Saccharomyces cerevisiae Putative transport proteins Fatty acid and phospholipid metabolism
Best match in E.coli Energy metabolism Central intermediary metabolism
log(Number of E.coli Putative chaperones Carbon compound catabolism
matches) Putative enzymes Hypothetical, unclassified, unknown
Codon Adaptation Index Other known genes tRNAs, rRNAs, and misc. RNAs
4,600,000
REFERENCES This article cites 76 articles, 21 of which you can access for free
http://science.sciencemag.org/content/277/5331/1453#BIBL
PERMISSIONS http://www.sciencemag.org/help/reprints-and-permissions
Science (print ISSN 0036-8075; online ISSN 1095-9203) is published by the American Association for the Advancement of
Science, 1200 New York Avenue NW, Washington, DC 20005. The title Science is a registered trademark of AAAS.
Copyright © 1997 The Authors, some rights reserved; exclusive licensee American Association for the Advancement of Science.
No claim to original U.S. Government Works.