You are on page 1of 26

Microsatellites in Small

Genomes

Milo Thurston
CEH Oxford
Genomes
• What is the genetic capacity of an
organism?
• What does an organism do with this
genetic capacity?
• How fast does this genetic capacity
change over time? How?
Microsatellites

a feature found across all genomes

(eukaryotes, bacteria, plasmids,
viruses, organelles)

Create a
“Microbial Genomes Microsatellite
Database”
inside the genomemine
Microsatellites are hot-spots of
mutation
(short direct repeats of 1-6 bp)

ATCGATGCATATATATATATATATATTGCCTGG (AT)9
Microsatellites are hot-spots of
mutation
(short direct repeats of 1-6 bp)

ATCGATGCATATATATATATATATATTGCCTGG (AT)9

ATCGATGCATATATATATATATATTGCC (AT)8
TGG

ATCGATGCATATATATATATATATATTGCCTGG (AT)9

ATCGATGCATATATATATATATATATATATTGCCTGG (AT)11
Microsatellites
• Molecular Markers in Eukaryotes
• Triplet-Repeat Expansion Diseases
• Contingency Loci in Pathogenic Prokaryotes

Why study microsatellites in ‘small’
genomes?
• Present in significant numbers, but not in all
genomes
• Availability of Collections of Genomes
• Mutation & Selection
• Develop New Molecular Markers
• Insights into Gene and Genome Mutability
• Detect and Study Loci under Selection
• Experimental Systems
Evolutionary Potential of
Microsatellite Loci
• Rapid Rates
• Reversible 10 - 10
-2 -5

ATATATATATATATATA

ATATATATATATATATATAT
Haemophilus influenzae
ATATATATATATATATA
lic1

“Phase Variation”
Many pathogenic bacteria have the ability
to rapidly switch the abundances and types
of molecules on their cell surface.
ON - OFF Molecular
Switches
(CAAT)40

(translational switch)

(AT)8
(transcriptional switch)
ON - OFF Molecular
Switches
(CAAT)39

(translational switch)

(AT)7
(transcriptional switch)
Origin and Maintenance of Loci Involved
in Antigenic Variation, Ecological
Tradeoffs &
‘Mutational Phenotypes’ ATGCAATCAATCAATCAATCAATCAATCAA
TCAATCAATCAATCAATCAATCAACAATCA
ATCAATCAATCAATCAAATTGTAGGATTTG
TTAAAACTTGCTACAAGCCTGAGGAAGTAT
TTCATTTTCTTCATCAGCATTCCATTCCTT
TTTCCTCCATTGGAGGAATGACCAATCAAA
ATGTTCTACTTAATATTTCTGGAGTTAAGT
TTGTATTACGGATCCCTAATGCCGTAAATT
TATCACTTATAAATCGAGA........
Genome Sequencing aids in the discovery of
microsatellite “molecular switches” in pathogenic
bacteria (“Contingency Loci” Moxon, Rainey, Nowak & Lenski, 1994)
Haemophilus influenzae (Hood et. al., 1996)
Helicobacter pylori (Tomb et. al., 1997; Alm et. al., 1999)
Campylobacter jejuni (Parkhill et.al., 2000)
Neisseria meningiditus (Tettlin et. al., 2000; Parkhill et. al., 2000)

Haemophilus influenzae Neisseria meningitidis
Functional Group Rd Gene Loci NmB NmA Gene Loci
pool pool
Evasins 0 0 2 1 2 siaD, porA
LPS Biosynthesis 4 5 lic1, lic2, lic3, 2 2 4 lgtA, lgtC, lgtD,
lgtC,lex2 lgtG
Adhesins 3 4 hmw1, hmw2, 10 7 13 pilC1, pilC2, pglA,
yadA, hifA/hifB opc, opaA-G,
NMB1998, yadA
Iron Acquisition 4 7 hgpA-C, 3 2 5 hpuA, hmbR, lgpA,
H10635, 0661, frpB
0712, 1565
Restriction- 1 2 mod, hsd 4 2 4 NMB0831, 1223,
Modification 1375, 1261,
Systems NMA1040, 1467
genomic survey...
Genomes (1802) with (467) and without microsatellites
(1329)
Num. Microsatellites

Num. Microsatellites Num. Microsatellites
Thresholds where mononucleotide >13 bp and di- to hexanucleotides are > 6 repeat
units)

nt e nt
te nt
Log Ge o n C o
nome C Log G C
Size C enom
G + e S i z e G+

G+C Content
Genomes with the most Microsatellites
Taxa Species Genome Size No. Freq in bp
in kb
Mitochondrion Saccharomyces cerevisiae 86 171 502
Virus Molluscum contagiosum virus subtype 1 190 79 2,409
Bacteria Xanthomonas axonopodis pv. citri str 5,176 61 84,845
Nucleomorph Guillardia theta 174 44 3,958
Bacteria Xanthomonas campestris pv. campestris 5,076 43 118,051
Virus shrimp white spot syndrome virus 305 42 7,264
Chloroplast Marchantia polymorpha 121 40 3,026
Nucleomorph Guillardia theta 181 40 4,523
Virus Gallid herpesvirus 3 164 40 4,107
Virus Spodoptera exigua nucleopolyhedrovirus 136 38 3,569
Bacteria Helicobacter pylori 26695 1,668 37 45,077
Chloroplast Chlorella vulgaris 151 37 4,071
Bacteria Xylella fastidiosa 2,679 35 76,552
Bacteria Helicobacter pylori J99 1,644 34 48,348

• Repeats extremely common in some genomes
Genomes with the lowest G+C content
Taxa Species Genome G+C No. Freq
Size in
kb
Mitochondrion Monosiga brevicollis 77 14 5 15,314
Mitochondrion Apis mellifera ligustica 16 15 8 2,043
Mitochondrion Saccharomyces cerevisiae 86 17 171 502
Mitochondrion Drosophila melanogaster 20 17 24 813
Virus Amsacta moorei entomopoxvirus 232 17 11 21,127
Mitochondrion Pichia canadensis 28 18 31 893
Mitochondrion Bombyx mori 16 18 11 1,422
Virus Melanoplus sanguinipes entomopoxvirus 236 18 11 21,465
Mitochondrion Bombyx mandarina 16 18 10 1,593
Mitochondrion Schizosaccharomyces japonicus 80 19 8 10,007
Mitochondrion Ostrinia furnacalis 15 19 1 14,536
Mitochondrion Ostrinia nubilalis 15 19 1 14,535
Mitochondrion Saccharomyces castellii 26 20 18 1,431
Mitochondrion Tetrahymena thermophila 48 20 11 4,325

• Repeats extremely common in some genomes
• G+C content is a factor (mutational pressure), but some
extremely G+C skewed genomes lack large number of
microsatellites (negative selection)
Genomes with the highest G+C content
Taxa Species Genome G+C No. Freq in
Size in kb
kb
Virus Bovine herpesvirus 1 135 72 25 5,412
Virus human herpesvirus 2 155 70 26 5,952
Virus human herpesvirus 1 152 68 17 8,957
bacteria Caulobacter vibrioides 4,017 67 29 138,515
bacteria Ralstonia solanacearum 3,716 67 28 132,729
bacteria Deinococcus radiodurans 2,649 67 3 882,879
bacteria Halobacterium sp. NRC-1 2,014 67 3 671,413
Virus Tupaia herpesvirus 196 66 16 12,241
bacteria Pseudomonas aeruginosa 6,264 66 5 1,252,881
Virus Grapevine fleck virus 8 66 2 3,782
bacteria Xanthomonas campestris pv. campestris 5,076 65 43 118,051
bacteria Mycobacterium tuberculosis 4,412 65 2 2,205,765
bacteria Mycobacterium tuberculosis CDC1551 4,404 65 1 4,403,836
bacteria Xanthomonas axonopodis pv. citri str 5,176 64 61 84,845
Plasmid Rhodococcus equi 81 64 1 80,609
Virus Molluscum contagiosum virus subtype 1 190 63 79 2,409
Microsatellite

Footprint Length
Footprint Length
versus Genome
Size
Longest repeats are
• extremely long
• hexanucleotides in Herpes
viruses, vertebrate 103 104 105 106 107
mitochondrial genomes, LogGenome
GenomeSize
Size
VNTR’s in pathogenic

Footprint Length
prokaryotes, contingency
loci
• artefact (plasmid
dinucleotide)
• viral virulence factor
(mononucleotide)
• Include long polymorphic
repeats in Baculoviruses
(variety of repeats)
• largely unannotated 103 104 105 106 107
Microsatellites in Bacteria

H. influenzae Pathogenic
Neisseria x 2 E. coli
H. pylori x
2

VNTRs

M. genitalium

C. jejuni
Observations
• ‘Small genomes’ have significant numbers of long
microsatellites
• A variety of factors, including genome size and G+C content,
contribute to presence/absence
• Taxonomic differences (numbers, motif types, biological
significance)
• Next step requires extensive curation (meta-data, genetic
content, homology, literature on phenotype and mutability)
genomemine/genomebank

merging evolutionary and ecological
meta-data with complete genome
sequences

“key”=“value”
information
Motivations
• Facilitate new computational studies
• Growth in the number of genomes
• Biological Patterns
• Biases
• Evaluate Prospective Data Sets
• Hypothesis Generation
• Inform ongoing computational/empirical
studies
genomemine/genomebank
• Automated retrieval of genomes (bacteria, plasmids, viruses, and organelles)

• Meta-data collected from:
– NCBI genome annotations (Genome Size, G+C, Taxonomy, Nucleic Acid type,
circular/linear, Number of Coding Regions, Percent Coding, A, C, T, G, A/C/T/G skew, etc)

– Primary genome publications (‘Why sequenced’, number of chromosomes, publication
date, number of ribosomal operons, tRNA genes, pseudogenes, megaplasmids, contingency loci,
etc).

– Ecological literature (habitat, extremophile?, host, carbon source, oxygen, shape, motile?,
etc)

– Analysis of meta-data (Description of Collections) (Total MB sequenced, Sub
totals of any subset or taxonomic level, alphabetical order, publication order, ranks, etc)

– Expert input on specific taxonomic groups (naming conventions, host range,
variables not shared across genomes)

– Informatic Studies (modules) (microsatellites, low complexity regions, orphans, etc)
Finished ‘genomebank’ reports
Future Taxonomy
genus
Feature Info

?
Value

Haemophilus
species ? influenzae

Interactive Plotting
strain ? Rd
Core Genome Features
Genome Size ? 1,830,138
ORFs ? 1709
G+C ? 39%
Orphans at Time of Publication ? 389
Ecology
Primary Habitat ? Human Host/Respiratory Tract
Interaction ? Commensal with ability to cause disease
Computed Features
Microsatellites ? 14 (Link in MGMD)
Percent Low Complexity ? 3.8

Acquisition of Orphans
Proportion Low Complexity as genomes are sequenced
and G+C Content 30000

Orphans
14 20000 Eubacteria
Percent Low
Complexity

12 10000
10 0
8 0 50000 100000 150000
6 Non-Orphans
4 8000

Orphans
2 6000
0 4000 Archaea
0 10 20 30 40 50 60 70 80 90 100 2000
0
G+C Content 0 10000 20000 30000 40000
Non-Orphans
Ecology Field
Phylogeny
Habitat “A Tree Viewer” (ATV)
Primary habitat Zmasek C. M and Eddy S.R (2001) ATV: display
Extremophile? and manipulation of annotated phylogenetic
Optimal Temperature
trees.
Optimal pH
Environmental Breadth
Bioinformatics. 17, 383-384.
Trophic Status http://www.genetics.wustl.edu/eddy/atv
Interaction
Obligate?
Guild
Annotated
Oxygen Phylogenies
Energy
Carbon
Growth
NCBI Taxonomy
Doubling Time in vitro
Morphology 16S ribosomal DNA
Shape (proteome
Gram stain comparisons)
Median width
Median length
Volume
Surface to volume ratio
Motile?
Summary
• Collections of Genomes present new
opportunities
• Should merge genomes with evolutionary and
ecological meta-data to put genomic information
in an ‘organismal context’
• Biological patterns/rules (biases/artefacts)
emerge (microsatellites)
• genomemine/genomebank
http://www.genomics.ceh.ac.uk/GMINE/
Acknowledgements
Dawn Field Ali Cody
Chris Bayliss
Jennifer Hughes Derek Hood
Adrian Tett Richard Moxon
Andrew Spiers
Sarah Turner
Mark Bailey