You are on page 1of 39

Association Genetics - Success and Challenge

• whole genome association or


linkage studies in humans:
• > 200 diseases or traits tested
à 5,000 unique SNP-trait
associations
BUT
• contribution of individual allele to
trait variation is low for many traits
(also remember that GWAS only detects
genetic component of trait variation, not
environmental!)
• risk of false positive association
• causal mutation / polymorphism
identified in very few loci only

GWAS catalog at EBI: https://www.ebi.ac.uk/gwas/


Association Genetics - Success and Challenge

• missed/weak associations

• contribution of individual genes may be small


(especially when studying complex traits)

• available markers to far away from causal polymorphism for given population

• possible solutions:

• increase population size (difficult to change when studying rare phenotypes)

• increase # of markers (if possible to ALL polymorphism present in population)

• multiple genotyping approaches using next generation sequencing


The 1,000 Human Genome Project
• Sequenced the complete genome of >2,500 individuals
• identified almost all polymorphism present in 29 populations
• elucidated the genetic diversity within the species
• Phase three papers published in 2015 (Nature, 526:68–74 and 75-81)
• ~80 million polymorphisms identified across population
• most are rare; only ~8 million with frequency >5%
• the typical individual genome (compared to reference):
• ~3.7 million variant SNP sites (~13,000 unique)
• ~600,000 small insertions/deletions (indels)
• ~1,100 large deletions / duplications
• ~11,000 peptide changing variants, 250 to 300 loss-of-function
• ~2,000 variants associated to GWAS traits
• ~25 clinical variants related to rare disease
Stepping back: Why sequencing whole genomes?
• association genetics
• polymorphism discovery: basis for whole genome association studies
• allows jump from genetic to physical map (sequence): candidate genes
• now polymorphism analysis (true genome-wide association)
• genome structure
• frequency and distribution of genes, repetitive elements, viral integrations, ...
• whole genome duplications
• comparative genomics
• evolution of genes and gene families
• loss and gain of genes on a global scale
• medical genomics - individualized medicine
• predict / prevent disease (genetic disposition)
• disease diagnosis
• individualized treatment (e.g. pharmacogenomics)
Genome sizes vary dramatically
• mRNA size (average): 2,500 bp
• gene size (average): 20,000 bp
• genome size: E. coli 5,000,000 bp ~5,000 genes
yeast 12,500,000 bp ~5,700 genes
C. elegans 100,000,000 bp ~19,500 genes
Drosophila 122,600,000 bp ~13,400 genes
Arabidopsis 125,000,000 bp ~28,000 genes
Poplar 500,000,000 bp ~42,000 genes
zebrafish 1,200,000,000 bp ~15,800 genes
human 3,300,000,000 bp ~20,000 genes
spruce 20,000,000,000 bp ~28,000 genes
lungfish 130,000,000,000 bp
Paris japonica 149,000,000,000 bp
Available Sequence Data Grow Exponentially

• Sequence data deposited


to NCBI (until 2013):
• exponential growth

Congressional Justification FY 2015; NLM;


https://www.nlm.nih.gov/about/2015CJ.html
Cost of sequencing: $/human genome
• the original ‘human
• more than exponential decline
genome project’:
• Sanger sequencing
• tool development
• time needed: 13 years
(1990-2003)
• overall cost:
$ 2,700,000,000
• resequencing an individual
genome: $100,000,000
Sanger • by 2008: $1,000,000
Next gen sequencing
NIH: Sequencing Costs: • current cost: $1,000
https://www.genome.gov/sequencingcosts/
Genome sequencing principle

• driven by sequencing technology


• max. length of contiguous
read length is limited

• major challenges:
• throughput
• read length
Genome sequencing: classical Sanger sequencing

Clone by clone approach


Genome sequencing: next generation sequencing
Sanger Sequencing: the principle
Fig.18.15, partial, Brooker, Genetics, 3rd edition

• mix DNA (plasmid/PCR product) with primer


• add dNTPs (all four bases), one radioactively labeled
• add one terminator (ddNTPs) per reaction
• start (multiple rounds) of polymerization
• elongation stops randomly whenever a terminator is incorporated
• separate fragments on gel
• expose to film
Automated Sanger Sequencing

• same principle
• terminator is labeled
• for each terminator a different fluorescent label is used
• all four reactions are run on one lane of the gel
• fluorescence is measured at end of gel
(liquid capillary gel)
Sanger Sequencing – Sample Preparation
Whole Genome Sequencing Strategy – The Classics

• Sanger Sequencing
• transform E. coli
• re-isolate plasmid
• perform sequencing
reaction

• repeat this 20,000,000,000


times for a human genome

throughput: up to 98 (or 384) reactions per run


up to 1,000 bases per reaction
2.5 h per run accuracies about 99.999%
Whole Genome Sequencing Strategy – The Next Generation

• massive parallel sequencing


• immobilize DNA fragments on surface
• (amplify fragments)
• sequence millions (or even billions) of
fragments in parallel

• very little sample prep


• high-throughput

• biggest challenge:
• shorter reads
• amount of data
à assembly
Sequencing by Synthesis – The Next Gen Principle

1. Immobilize DNA fragment (or amplicon)


2. Anneal Primer
3. Allow synthesis of one base
4. Detect which base was incorporated
5. Go back to 3., repeat cycle

• two concepts:
• stop reaction after incorporating one
base, measure result
• monitor reaction ‘on the go’

Gupta (2008) Trends


Biotech. 26: 573-583

Gupta (2008) Trends


Biotech. 26: 573-583
Illumina Sequencing: Library production

1. random fragmentation of DNA into small pieces

2. ligate unique adapters to each end: adaptor flanked shotgun library

3. bind denatured fragments to flow cell surface

4. flow cell surface also contains oligos (primers) complementary to


adapters in high density

Note many other platforms exist


that realize amplification /
immobilization differently

Goodwin et al. (2016)


Nat. Rev. Genet. 17: 333-351
Illumina Sequencing–
Illumina Sequencing: The Procedure
Library production (Bridge PCR)

5. perform bridge amplification (PCR)


• primers in the vicinity of the attached fragment act as primer
à amplicon stays physically at position of original starting fragment

Result: amplified fragments attached to surface (> 100 million spots)

Goodwin et al. (2016)


Nat. Rev. Genet. 17: 333-351
Illumina Sequencing: Sequence capture

Sequence capture:
6. add primer complementary to one of the
adapters
7. add reversible terminators (each base labeled
with a different fluorescent dye, fluorophore
prevents incorporation of another base)
8. add polymerase to add ONLY ONE base,
wash unincorporated nucleotides off
9. scan flow cell with fluorescent laser
scanner, record first base
10.cleave off fluorescent dye, reverts
incorporated terminator to regular base
11. repeat 7. to 10. to capture subsequent bases

Goodwin et al. (2016)


Nat. Rev. Genet. 17: 333-351
Illumina Sequencing: Sequence capture
C
Sequence capture: C
6. add primer complementary to one of the
adapters A
7. add reversible terminators (each base labeled C
with a different fluorescent dye, fluorophore
prevents incorporation of another base) T
C read length
8. add polymerase to add ONLY ONE base,
wash unincorporated nucleotides off up to 150
C
9. scan flow cell with fluorescent laser
C
scanner, record first base
plus another
10.cleave off fluorescent dye, reverts G up to
incorporated terminator to regular base C 20,000,000,000
11. repeat 7. to 10. to capture subsequent bases reads...
T
C .
. .
Illumina Sequencing– Summary
throughput: up to 20,000,000,000 sequence reactions per run,
but only 150 (or 300) bases max per reaction
up to 6,000 GB per run high quality sequence in a few days

• short sequence length per read:


• difficult to assemble complete genomes de novo
• repetitive elements with >100 almost identical bases are fairly frequent....
• paired end reads possible (get 150 bases from each side of each PCR cluster)
• assembly “easy” if a reference genome available (or of a fairly close relative)
• de novo assembly of genomes possible only if combined with other methods that provide
longer read length
• used to re-sequence individual genomes of individuals (or cells from an individual)
• analysis of polymorphisms (SNP markers) for association genetics
• identification of somatic mutations, e.g. in cancer tissues
Turning It Upside Down: Pac Bio Single
Molecule Real Time Sequencing

• the idea: immobilize the


DNA polymerase, not
the DNA:

Fig 1e; Metzker (2010)


• the problem: Nat. Rev. Genet. 11: 31-46

• can’t stop polymerization after one incorporation,


the DNA would leave the polymerase...
• have to distinguish between a base that is incorporated from one that
is just diffusing by (incorporation takes longer, but still just a few
milliseconds, and you’ll have way more diffusing than incorporating
bases...)
PacBio: Single Molecule Real Time Sequencing (SMRT)

• DNA polymerase is immobilized at bottom of


‘wells’ (zero-mode waveguides)
• wells are so small that light CANNOT pass,
it only enters the lowest few femto meter
à fluorescence is measured ONLY in a very small
area around the polymerase (20 zeptoliter, or
10−21 or 0.000000000000000000001 L)
• reduces actual concentration of nucleotides in
detection area (pico-molar) compared to overall
conc. in solution above (micro-molar)
• provides optimal conc. for polymerase, but
reduces noise in detection area

Goodwin et al. (2016)


Nat. Rev. Genet. 17: 333-351
SMRT: Sequence capture
• when incorporated a labeled nucleotide stays longer in detection area
(milliseconds), compared to a diffusing nucleotide
• this increases fluorescence over background
• after incorporation the
pyrophosphate and with
it the label is cleaved
and diffuses away
• detection is continuous
• very fast
• can go on very long
• but with high error rate
Fig 1b; Metzker (2010)
Nat. Rev. Genet. 11: 31-46
SMRT: Repeated sequence capture
• sample preparation:
• fragment DNA (~20 kb)
• ligate hairpin adapter to each end
• denature DNA
(generates circular single stranded DNA)
à the same DNA will be sequenced multiple times (reduces error rate)

Summary: only up to 55,000 sequence reactions per run,


but very long reads (45kb average, up to 300kb)
up to 20 GB per run
Moving Away from Enzymes: Nanopore Sequencing
Principle:
• membrane containing small pores
(~1.5-2 nm) - nanopore
Protein pore
• target DNA on one side in lipid bilayer
membrane
• current applied (causing DNA to
move through the pores)

current
• movement blocks channel, leads to
change in current
• induced current change differ
depending on base composition of

Nat. Rev. Genet. 17: 333-351


DNA in pore

Goodwin et al. (2016)


• received signal compared known
signals of oligonucleotides
• new signal as new base enters pore
Moving Away from Enzymes: Nanopore Sequencing

• sample prep:
• fragment DNA (200 kb plus)
• add leader to one side
à directs DNA to motor protein at pore
• add hairpin adapter to other side
à allows sequencing the same fragment twice
• still has very high error rate (100 times higher than Illumina)
• BUT:
• very long reads (10,000 times longer than Illumina, 10
times longer than SMRT)
• high throughput (up to 40 GB)
• very fast (up to 500 bp per second, two days per run)
• small instrument footprint
Summary

• advantages of next generation sequencing • applications


• throughput: • whole genome (re-)sequencing
• hardly any sample preparation • de novo sequencing
• millions and billions of sequence • SNP analysis
reads • thousand genome projects
• problems • somatic variations (cancer)
• millions and billions of sequence reads • comparative genomics
• short read length (for some) (neanderthal, black death genome..)
• higher error rate (compared to Sanger) • metagenomics

à difficult to assemble de novo • microbiome


• epigenomics
• DNA methylome
• transcriptomics
Contig Assembly

• classical alignment programs (e.g. CAP3)


• pairwise comparison
• identify identical region, extend search in both directions
• penalize differences and gaps
• “The CAP program is efficient in computer time and
memory; it took about 4 h to assemble a set of 1015
fragments into long contigs on a Sun workstation.”
Huang (1992) Genomics 14:18-25
à if linear, it would take 8,997 years to assemble a single
Illumina lane on that Sun Workstation….

Huang & Madan (1999) Genome


Res. 9:868-877
Contig Assembly – de novo
• substrings of a given length (k-mers)
generated from each read
• each possible k-mer represent a node in a graph
• two nodes are connected if shifting a k-mer by one
character creates an exact overlap (no fuzziness…)
• if k-mers from different reads follow the same path,
they are overlapping
• common path may diverge (polymorphisms)
• single-nucleotide differences cause ‘bubbles’ of
length k in the graph
• introns or deletions introduce shorter path in
the graph
• chains of common nodes are collapsed into a single
node
• possible paths (blue, red, yellow and green) through
Martin (2011) Nat. Rev. the graph are chosen
Genet.. 12:671-682
• isoforms are then assembled
Contig Assembly – de novo
• Advantages:
• no reference genome needed
• identifies novel regions not present in reference (if available)
• can identify genomic DNA/transcripts from exogenous sources
(bacteria, viruses, ... )
• for RNAseq:
• long introns are not a problem (simply not present in
comparisons)
• Disadvantages
• computationally intensive (needs up to a terabyte of memory)
• creates smaller contigs: many gaps in assembly
à need much higher sequencing depth
à for genome seq: needs complementation with longer reads
à for RNAseq: many split transcripts...
The Result of A Whole Genome Sequencing Project
Gene Finding Algorithms – Focus on Protein Coding Genes
central dogma of gene expression
• gene structures:
• promoter
• transcriptional start site
• exons / introns
• different types
• open reading frame
• START/STOP codons
• untranslated regions
• 5’ and 3’ UTR
• poly-adenylation signal
• transcription termination signal

gene features are defined in the DNA sequence Zhang (2002) Nat. Rev.

but they are short and not highly conserved….


Genet. 3:698-709
Gene Finding Algorithms – Focus on Protein Coding Genes
gene features are defined in the DNA sequence
but they are short and not highly conserved….

Zhang (2002) Nat. Rev.


Genet. 3:698-709

• for example, intron definition


• 5ʹ splice site consensus: AG|GU But in a complete gene, at least
some of these features should be
• intron branch site: CUNAN present in a particular order
• 3ʹ splice site consensus: AG|G
Gene Finding Algorithms – Focus on Protein Coding Genes
• search for likelihood of gene features in the right order
[using a Hidden Markov Model (HMM)]
• based on a training set of known genes
from your organism
• specify characteristics of each part
[or states] of a gene
• determine frequencies that a state
Zhang (2002) Nat. Rev.
(e.g. an exon with translational start (Einit)) Genet. 3:698-709
comes after a particular other
(eg a 5’UTR) [transitions]
• use these ‘learned’ characteristics to
evaluate putative gene states and transitions
in the complete genome
à determine likelihood of having found a complete gene

resulting gene models are the first annotation of the genome Zhang (2002) Nat. Rev.
Genet. 3:698-709
Gene annotation – Similarity Based Approaches

• take advantage of existing transcript or protein sequence information


• map available transcript sequences from same species onto genome
• search genome (protein translation of all six frames) against all
known protein coding genes
• map annotated genes from closely related species onto genome
• disadvantage: you can find only what’s already known…

• Most frequently used: a combined approach


• gene predictors that take mapped transcripts into account
Gene and genome annotation: Functional annotation and data dissemination

• retrieve data for user


driven analysis
• sequences
• annotations

Rust et al (2002) Drug Disc. Today.


7:S70-S76
Graphical Display of a Genome –
Genome Browser

• information linked to genome sequence


• gene models
• transcripts
• EST, full length cDNA
• also from other species
• similarity to other genomes (‘Vista plot’)
• regulatory elements
• genetic markers (microsatellites, poly-
morphisms, etc.)
• mutations (point mutations, transposon,
insertions)
• linked to additional data available
Genome Portals

• additional information linked to gene model


• sequences
• Functional Annotations
• name(s)
• function (known or predicted)
• protein properties (domains, size, etc.)
• systematic categorization
(Gene Ontologies)
• mutant phenotypes and lines available
• references
• expression patterns
•…

You might also like