Pacific Biosciences - Christoph Koenig - Fundamentals and Applications 26032014

Fundamentals and Applications of Single Molecule
Real-Time SMRT® Sequencing
CAT-AgroFood Plant Research International Workshop for Pacbio Sequencing

March 26, 2014 Dr. Christoph König
FIND MEANING IN COMPLEXITY

Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, and SMRTbell are trademarks of Pacific Biosciences in the United States and/or other countries. Celera is a trademark of Celera Corporation; and HiSeq and
© Copyright 2013 by Pacific Biosciences of California, Inc. All rights reserved.
MiSeq are trademarks of Illumina, Inc.© Copyright 2014 by Pacific Biosciences of California, Inc. All rights reserved.
Single-Molecule, Real-Time DNA Sequencing (SMRT) Is:
DNA Polymerase ZMW Confinement Phospholinked Nucleotides

PacBio® RS II Typical Performance
Read Definitions in RS System & SMRT® Analysis v2.0
SMRTbell™ Template
Polymerase Read Subreads Read (of Insert)

Definition: Definition: Definition:
• Formerly called “read” • Adapters removed • The highest quality
• 1 pass • 1 pass single sequence for an
• With adapters • 1 molecule, 1+ subread insert
• 1 molecule, 1 pol. read • 1+ passes including
Uses: partial passes
Uses: • Applications such as • 1 molecule, 1 read
• QC of instrument run assembly and base
modification Uses:
• Insert size distribution
Blue Pippin™ System for Size Selection
Size-Selected
Mouse Lemur
20 kb library
20 kb AMPure®
Mouse Lemur
library
- Input gDNA
- Size-selected
Most Uniform Coverage
“Pacific Biosciences coverage

levels are the least biased”
• Ross et al. (2013) Characterizing and measuring bias in sequence data. Genome Biology, May 29;14(5):R51
Detection of DNA Base Modifications by SMRT
Sequencing
Flusberg et al. (2010) Nature Methods 7: 461-465

Summary Sequence Performance
1. Long sequence reads
– Finish genomes, de novo assemblies
– Full-length cDNA sequencing
– Long-range haplotype phasing
2. High Consensus Accuracy
– >99.999% (QV50)
– Lack of systematic sequencing errors
3. Lack of sequence context bias
– GC content
– Low complexity sequence
4. Base modification detection
– Epigenome characterization
De Novo Assembly

Advantages of SMRT® Sequencing:
Impact of Long Read Lengths on De Novo Assembly
What can be achieved with infinite coverage given the read length?
PacBio
Koren S. et. al. (2013) Reducing assembly complexity of microbial genomes with single molecule sequencing.
Genome Biology, 14:R101
Easy Bioinformatics Solution to Finish Genomes Using
Only PacBio® Reads
Full push-button solution from

beginning to end
• Longest reads for continuity
• All reads for high consensus
accuracy
Hierarchical Genome Assembly Process (HGAP)
Watch SMRT® Analysis Tutorial: Bacterial Assembly and

Epigenetic Analysis
Chin CS., et. al. (2013) Nonhybrid , finished microbial genome assemblies from long-read SMRT
sequencing data. Nat Methods. Jun;10(6):563-9.
SMRT® Sequencing:
Gold Standard for microbial De Novo Assembly
Progress of PacBio-Only De Novo Assembly
2013 2014
Spinach
1 Gb
Drosophila Contig N50
170 Mb 531 kb
Arabidopsis
Contig N50
Yeast 120 Mb
Bacteria 4.5 Mb
12 Mb Contig N50
1-10 Mb Resolve most 7.1 Mb Human
Finished chromosomes (haploid)
Genomes
3.2 Gb
Contig N50
4.4 Mb
Max=44 Mb
PacBio-Only Sequencing of Arabidopsis
• Original Col-0 strain assembly (Sanger + manual finishing)

• ~$70M, several years
• PacBio® data recently used to assemble Ler-0 strain
Short-read PacBio reads

Improvement
(Ler 1)* (Ler-0)
Est. Genome
Size (Mb)
110.4 124.6 11.5%
Polished
Contigs
4,662 545 8.5X
N50 Contig
Length (Mb)
0.067 6.36 95X
Max Contig
Length (Mb)
0.46 13.21 29X
*http://1001genomes.org/data/MPI/MPISchneeberger2011/releases/current/
Read Blog Entry Download Arabidopsis

SNP Discovery with PacBio® Assemblies
Discovery of single nucleotide polymorphism by PacBio assemblies
Mapping of ILMN PE or PacBio Assembly to TAIR 10
Ler0 ILMN PacBio Ler0

PE Assembly
509,836
27,106 95%/68% 238,637 Mapping of ILMN PE to PacBio Assembly
Ler0 PE – Ler0 Assembly 885 homozygous SNPs

Cvi PE – Cvi Assembly 838 homozygous SNPs
SNP frequency 7.5 x 106
Cvi ILMN PacBio Cvi These SNPs are highly enriched in peri-
PE Assembly centromere and associate with aberrantly
685,104 high coverage number
55,947 92%/72% 271,335
Called SNPs between Cvi and Col
17
Watch Arabidopsis Genome Recording Other PAG XXII Recordings

SNP Discovery with PacBio® Assemblies
PacBio assembly identifies SNPs in Illumina low-

coverage (unmappable) regions
Called SNPs between Cvi and Col
Both
Illumina only
PacBio only
Analysis by Jason Chin
18
Watch Arabidopsis Genome Recording Other PAG XXII Recordings

Assembling Rice Genomes
21
• Watch Richard McCombie's 2014 AGBT presentation
PacBio-Only Sequencing of a Spinach Genome (980 Mb)
Watch Spinach Genome Recording Other PAG XXII Recordings

Long-Read Shotgun Human Genome Data Release
• 54x coverage of CHMT1 cell line

• Avg SMRT® Cell throughput: 608 Mb
• Avg DNA insert length: 7,680 bp
• Half of sequenced bases in reads
greater than: 10,739 bp
• Longest DNA insert sequenced:
42,774 bp
Read Blog Post
Download Dataset
Human Genome De Novo Assemblies Comparison
4500
Contig N50 (kb) 4378
4000
3500
3000
2500
2000
1500
1000
500 144
107 7,4 5,5 24 127
0
2007 2009 2010 2010 2013 2013 2014
2007 2009 2010 2010 2013 2013 2014
HuRef (Venter) BGI YH KB1 NA12878 RP11_0.7 CHM1 CHM1
Technology ABI 3730 Illumina GA 454 GS FLX Illumina GA 454 GS, HiSeq, PacBio RS II
Titanium HiSeq, MiSeq BAC clones
Assembly method Celera SOAP Newbler ALLPATHS-LG Newbler Reference FALCON,
Assembler de novo Guided Celera
Assembler
# of library types 4 5 2 5 3 NA 1
Total assembly size

(Gb)
2.78 2.46 2.79 2.82 2.81 2.83 3.25
Data sources: HuRef (Venter) (http://www.plosbiology.org/article/info:doi/10.1371/journal.pbio.0050254); BGI YH (http://genome.cshlp.org/content/

20/2/265.abstract Table II); KB1 (http://www.nature.com/nature/journal/v463/n7283/full/nature08795.html); NA12878 (http://www.pnas.org/content/
early/2010/12/20/1017351108.abstract Table3); CHM1 (http://www.ncbi.nlm.nih.gov/assembly/GCF_000306695.2/)
Comparison of Human CHM1 Assemblies
2014 PacBio® de novo

44 MB
contig 2013 reference-guided
short-read with BACs
gaps
MHC region
The Next Challenge: Assembling Diploid Genomes
Developing
bioinformatics and
visualization tools to
resolve diploid
genomes
Early
assembly
result for the
Ler-0 + Col-0
Watch Jason Chin’s 2014 AGBT “synthetic” diploid
presentation “String Graph Assembly for
Diploid Genomes with Long Reads”
Benefits of PacBio® Sequencing for Large Genomes
• PacBio data complements short reads to improve new and existing

de novo assemblies
• Improve N50 contig length even with modest 5x coverage
• Scaffold PacBio long reads to set framework for genome completion
• Resolve troublesome gaps with low-complexity and repetitive

genomic regions
• Catalog transposable elements
• Conduct gene-specific surveys
PacBio® De Novo Assembly Homepage

PacBio® Isoform Sequencing of Full-length Transcripts

Transcript Diversity
Current State of Transcript Assembly
“The way we do RNA-seq now is…

you take the transcriptome, you
blow it up into pieces and then
you try to figure out how they all
go back together again… If you
think about it, it’s kind of a crazy
way to do things”
Michael Snyder
Professor and Chair of Genetics
Stanford University
Tal Nawy, End to end RNA Sequencing, Nature
Methods, v10, n10, Dec . 2013, p1144–1145
Ian Korf (2013) Genomics: the state of the art in

RNA-seq analysis, Nature Methods, Nov 26;10(12):1165-6.
doi: 10.1038/nmeth.2735.
PacBio’s Iso-Seq™ Method for High-quality, Full-length Transcripts
Experimental Pipeline
cDNA synthesis Size partitioning & SMRTbell™ PacBio® RS II
with adapters PCR amplification ligation Sequencing
a b
1 2 3 4 pipeline
Experimental 5
PolyA mRNA
AAAAA
TTTTT Informatics pipeline
AAAAA
TTTTT
AAAAA AAAAA
AAAAA
TTTTT
TTTTT polyA mRNA
AAAAA
AAAAA AAAAA
PacBio raw
AAAAA
TTTTT
TTTTT AAAAA sequence reads
AAAAA
AAAAA AAAAA
TTTTT AAAAA
AAAAA AAAAA
TTTTT AAAAA
Remove adapters
Remove artifacts
cDNA synthesis
SampleNet: Iso-Seq Method with Clonetech cDNA Synthesis Kit Clean
with adapters
sequence reads
AAAA
polyA TTTT
5’ primer Coding sequence
tail 3’ primer
AAAA
TTTT Reads clustering
Raw (AAA)nAAAA
n
TTTT
(TTT)n SMRT adapter
AAAA
TTTT SMRT adapter
(TTT)n Size partitioning & Isoform clusters
PCR amplification
AAAA
TTTT
Consensus calling
AAAA
TTTT
AAAA
TTTT Nonredundant
AAAA transcript isoforms
TTTT
Reads of Insert (AAA)nn
SMRTbell ligation Quality filtering
Informatics Pipeline Final isoforms

Evidenced-based
RS sequencing gene Map
models
to reference genome
6 7 8 9 10
PacBio raw Clean Nonredundant
Isoform
sequence sequence transcript Final isoforms
clusters
reads reads isoforms
Evidence-based gene models
Remove adapters Reads Consensus Quality Map to

Remove artifacts clustering calling filtering reference genome
DevNet: Iso-Seq wiki page

No Assembly required
Multiple isoforms observed at a single loci
Rat heart Rat lung
Tseng, PAG 2014, “ Isoform Sequencing: Unveiling the Complex Landscape of the Eukaryotic Transcriptome on the
PacBio® RS II” (poster)
“Gene Identification, Even in Well-Characterized Human
Cell Lines and Tissues, is Likely Far From Complete”
8,048 RefSeq-annotated, full-length isoforms and 5,459

predicted isoforms
“Over one-third of these are novel isoforms, including 273

RNAs from gene loci that have not previously been identified”
Au et al. (2013) Characterization of the human ESC transcriptome by hybrid sequencing. PNAS doi:
10.1038/pnas.1320101110.
ABRF NGS RNA-Seq Comparative Study:
Iso-Seq™ Application provides Most Uniform 5’ to 3’ Coverage
Splice Landscape of Neurexin 1a
Nrxn1α domain
structure
Splice isoform
abundance
(2,574 full-length
Nrxn1α mRNAs
247 unique sequence reads)
alternatively-
spliced
isoforms
6 SMRT® Cells
Exons
• green – present
• white – absent
Treutlein et al. (2014) Cartography of neurexin alternative splicing mapped by single-molecule long-read mRNA
sequencing. PNAS. doi:10.1073/pnas.1403244111
PacBio® Sequences Used for
Gene Model Validation in Lettuce
Without
PacBio
reads
Additional ~5000
gene models
Including validated
PacBio
reads
Confidence
PAG 2014, Marilena Christopouku “Targeted transcriptome analysis using PacBio sequencing to dissect multi-gene
families encoding NBS-LBR resistance proteins in lettuce”
PacBio® Iso-Seq Data Used to Confirm Predicted
Scaffolds in Norway Spruce Genome
14 SMRT® Cells
of PacBio data
using early
chemistry &
protocols
39
PAG 2014: Yao-Cheng Lin “PacBio cDNA sequencing of Norway spruce”

Selection of Additional Customer References/Publications
Click on graphic to hyperlink to example

Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, SMRTbell, and Iso-Seq are trademarks of Pacific
Biosciences in the United States and/or other countries. All other trademarks are the sole property of their respective owners.

Pacific Biosciences - Christoph Koenig - Fundamentals and Applications 26032014

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Pacific Biosciences - Christoph Koenig - Fundamentals and Applications 26032014

Uploaded by

Copyright:

Available Formats

Fundamentals and Applications of Single Molecule

Real-Time SMRT® Sequencing

CAT-AgroFood Plant Research International Workshop for Pacbio Sequencing

FIND MEANING IN COMPLEXITY

DNA Polymerase ZMW Confinement Phospholinked Nucleotides

Polymerase Read Subreads Read (of Insert)

“Pacific Biosciences coverage

Flusberg et al. (2010) Nature Methods 7: 461-465

1. Long sequence reads

– Finish genomes, de novo assemblies

– Full-length cDNA sequencing

– Long-range haplotype phasing

2. High Consensus Accuracy

– Lack of systematic sequencing errors

3. Lack of sequence context bias

– Low complexity sequence

4. Base modification detection

FIND MEANING IN COMPLEXITY

Full push-button solution from

Hierarchical Genome Assembly Process (HGAP)

Watch SMRT® Analysis Tutorial: Bacterial Assembly and

• Original Col-0 strain assembly (Sanger + manual finishing)

Short-read PacBio reads

Read Blog Entry Download Arabidopsis

Discovery of single nucleotide polymorphism by PacBio assemblies

Mapping of ILMN PE or PacBio Assembly to TAIR 10

Ler0 ILMN PacBio Ler0

Ler0 PE – Ler0 Assembly 885 homozygous SNPs

Called SNPs between Cvi and Col

Watch Arabidopsis Genome Recording Other PAG XXII Recordings

PacBio assembly identifies SNPs in Illumina low-

Analysis by Jason Chin

Watch Arabidopsis Genome Recording Other PAG XXII Recordings

Watch Spinach Genome Recording Other PAG XXII Recordings

• 54x coverage of CHMT1 cell line

Read Blog Post

Total assembly size

Data sources: HuRef (Venter) (http://www.plosbiology.org/article/info:doi/10.1371/journal.pbio.0050254); BGI YH (http://genome.cshlp.org/content/

2014 PacBio® de novo

• PacBio data complements short reads to improve new and existing

• Improve N50 contig length even with modest 5x coverage

• Scaffold PacBio long reads to set framework for genome completion

• Resolve troublesome gaps with low-complexity and repetitive

• Catalog transposable elements

• Conduct gene-specific surveys

PacBio® De Novo Assembly Homepage

FIND MEANING IN COMPLEXITY

“The way we do RNA-seq now is…

Ian Korf (2013) Genomics: the state of the art in

Informatics Pipeline Final isoforms

Remove adapters Reads Consensus Quality Map to

DevNet: Iso-Seq wiki page

Multiple isoforms observed at a single loci

Rat heart Rat lung

8,048 RefSeq-annotated, full-length isoforms and 5,459

“Over one-third of these are novel isoforms, including 273

PAG 2014: Yao-Cheng Lin “PacBio cDNA sequencing of Norway spruce”

Click on graphic to hyperlink to example

You might also like