You are on page 1of 36

Fundamentals and Applications of Single Molecule

Real-Time SMRT® Sequencing

CAT-AgroFood Plant Research International Workshop for Pacbio Sequencing


March 26, 2014 Dr. Christoph König

FIND MEANING IN COMPLEXITY


Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, and SMRTbell are trademarks of Pacific Biosciences in the United States and/or other countries. Celera is a trademark of Celera Corporation; and HiSeq and
© Copyright 2013 by Pacific Biosciences of California, Inc. All rights reserved.
MiSeq are trademarks of Illumina, Inc.© Copyright 2014 by Pacific Biosciences of California, Inc. All rights reserved.
Single-Molecule, Real-Time DNA Sequencing (SMRT) Is:

DNA Polymerase ZMW Confinement Phospholinked Nucleotides


PacBio® RS II Typical Performance
Read Definitions in RS System & SMRT® Analysis v2.0

SMRTbell™ Template

Polymerase Read Subreads Read (of Insert)


Definition: Definition: Definition:
• Formerly called “read” • Adapters removed • The highest quality
• 1 pass • 1 pass single sequence for an
• With adapters • 1 molecule, 1+ subread insert
• 1 molecule, 1 pol. read • 1+ passes including
Uses: partial passes
Uses: • Applications such as • 1 molecule, 1 read
• QC of instrument run assembly and base
modification Uses:
• Insert size distribution
Blue Pippin™ System for Size Selection

Size-Selected
Mouse Lemur
20 kb library

20 kb AMPure®
Mouse Lemur
library
- Input gDNA
- Size-selected
Most Uniform Coverage

“Pacific Biosciences coverage


levels are the least biased”

• Ross et al. (2013) Characterizing and measuring bias in sequence data. Genome Biology, May 29;14(5):R51
Detection of DNA Base Modifications by SMRT
Sequencing

Flusberg et al. (2010) Nature Methods 7: 461-465


Summary Sequence Performance

1. Long sequence reads

– Finish genomes, de novo assemblies

– Full-length cDNA sequencing

– Long-range haplotype phasing

2. High Consensus Accuracy

– >99.999% (QV50)

– Lack of systematic sequencing errors

3. Lack of sequence context bias

– GC content

– Low complexity sequence

4. Base modification detection

– Epigenome characterization
De Novo Assembly

FIND MEANING IN COMPLEXITY


© Copyright 2013 by Pacific Biosciences of California, Inc. All rights reserved.
Advantages of SMRT® Sequencing:
Impact of Long Read Lengths on De Novo Assembly

What can be achieved with infinite coverage given the read length?

PacBio

Koren S. et. al. (2013) Reducing assembly complexity of microbial genomes with single molecule sequencing.
Genome Biology, 14:R101
Easy Bioinformatics Solution to Finish Genomes Using
Only PacBio® Reads

Full push-button solution from


beginning to end
• Longest reads for continuity
• All reads for high consensus
accuracy

Hierarchical Genome Assembly Process (HGAP)

Watch SMRT® Analysis Tutorial: Bacterial Assembly and


Epigenetic Analysis
Chin CS., et. al. (2013) Nonhybrid , finished microbial genome assemblies from long-read SMRT
sequencing data. Nat Methods. Jun;10(6):563-9.
SMRT® Sequencing:
Gold Standard for microbial De Novo Assembly
FIND MEANING IN COMPLEXITY
© Copyright 2013 by Pacific Biosciences of California, Inc. All rights reserved.
Progress of PacBio-Only De Novo Assembly

2013 2014

Spinach
1 Gb
Drosophila Contig N50
170 Mb 531 kb
Arabidopsis
Contig N50
Yeast 120 Mb
Bacteria 4.5 Mb
12 Mb Contig N50
1-10 Mb Resolve most 7.1 Mb Human
Finished chromosomes (haploid)
Genomes
3.2 Gb
Contig N50
4.4 Mb
Max=44 Mb
PacBio-Only Sequencing of Arabidopsis

• Original Col-0 strain assembly (Sanger + manual finishing)


• ~$70M, several years
• PacBio® data recently used to assemble Ler-0 strain

Short-read PacBio reads


Improvement
(Ler 1)* (Ler-0)

Est. Genome
Size (Mb)
110.4 124.6 11.5%
Polished
Contigs
4,662 545 8.5X
N50 Contig
Length (Mb)
0.067 6.36 95X

Max Contig
Length (Mb)
0.46 13.21 29X

*http://1001genomes.org/data/MPI/MPISchneeberger2011/releases/current/

Read Blog Entry Download Arabidopsis


SNP Discovery with PacBio® Assemblies

Discovery of single nucleotide polymorphism by PacBio assemblies

Mapping of ILMN PE or PacBio Assembly to TAIR 10

Ler0 ILMN PacBio Ler0


PE Assembly
509,836
27,106 95%/68% 238,637 Mapping of ILMN PE to PacBio Assembly

Ler0 PE – Ler0 Assembly 885 homozygous SNPs


Cvi PE – Cvi Assembly 838 homozygous SNPs
SNP frequency 7.5 x 106

Cvi ILMN PacBio Cvi These SNPs are highly enriched in peri-
PE Assembly centromere and associate with aberrantly
685,104 high coverage number
55,947 92%/72% 271,335

Called SNPs between Cvi and Col

17

Watch Arabidopsis Genome Recording Other PAG XXII Recordings


SNP Discovery with PacBio® Assemblies

PacBio assembly identifies SNPs in Illumina low-


coverage (unmappable) regions
Called SNPs between Cvi and Col

Both

Illumina only

PacBio only

Analysis by Jason Chin

18

Watch Arabidopsis Genome Recording Other PAG XXII Recordings


Assembling Rice Genomes

21
• Watch Richard McCombie's 2014 AGBT presentation
PacBio-Only Sequencing of a Spinach Genome (980 Mb)

Watch Spinach Genome Recording Other PAG XXII Recordings


Long-Read Shotgun Human Genome Data Release

• 54x coverage of CHMT1 cell line


• Avg SMRT® Cell throughput: 608 Mb
• Avg DNA insert length: 7,680 bp
• Half of sequenced bases in reads
greater than: 10,739 bp
• Longest DNA insert sequenced:
42,774 bp

Read Blog Post

Download Dataset
Human Genome De Novo Assemblies Comparison

4500
Contig N50 (kb) 4378

4000

3500

3000

2500

2000

1500

1000

500 144
107 7,4 5,5 24 127
0
2007 2009 2010 2010 2013 2013 2014
2007 2009 2010 2010 2013 2013 2014
HuRef (Venter) BGI YH KB1 NA12878 RP11_0.7 CHM1 CHM1

Technology ABI 3730 Illumina GA 454 GS FLX Illumina GA 454 GS, HiSeq, PacBio RS II
Titanium HiSeq, MiSeq BAC clones
Assembly method Celera SOAP Newbler ALLPATHS-LG Newbler Reference FALCON,
Assembler de novo Guided Celera
Assembler

# of library types 4 5 2 5 3 NA 1

Total assembly size


(Gb)
2.78 2.46 2.79 2.82 2.81 2.83 3.25

Data sources: HuRef (Venter) (http://www.plosbiology.org/article/info:doi/10.1371/journal.pbio.0050254); BGI YH (http://genome.cshlp.org/content/


20/2/265.abstract Table II); KB1 (http://www.nature.com/nature/journal/v463/n7283/full/nature08795.html); NA12878 (http://www.pnas.org/content/
early/2010/12/20/1017351108.abstract Table3); CHM1 (http://www.ncbi.nlm.nih.gov/assembly/GCF_000306695.2/)
Comparison of Human CHM1 Assemblies

2014 PacBio® de novo


44 MB
contig 2013 reference-guided
short-read with BACs
gaps

MHC region
The Next Challenge: Assembling Diploid Genomes

Developing

bioinformatics and

visualization tools to

resolve diploid

genomes

Early
assembly
result for the
Ler-0 + Col-0
Watch Jason Chin’s 2014 AGBT “synthetic” diploid
presentation “String Graph Assembly for
Diploid Genomes with Long Reads”
Benefits of PacBio® Sequencing for Large Genomes

• PacBio data complements short reads to improve new and existing


de novo assemblies

• Improve N50 contig length even with modest 5x coverage

• Scaffold PacBio long reads to set framework for genome completion

• Resolve troublesome gaps with low-complexity and repetitive


genomic regions

• Catalog transposable elements

• Conduct gene-specific surveys

PacBio® De Novo Assembly Homepage


PacBio® Isoform Sequencing of Full-length Transcripts

FIND MEANING IN COMPLEXITY


© Copyright 2014 by Pacific Biosciences of California, Inc. All rights reserved.
Transcript Diversity
Current State of Transcript Assembly

“The way we do RNA-seq now is…


you take the transcriptome, you
blow it up into pieces and then
you try to figure out how they all
go back together again… If you
think about it, it’s kind of a crazy
way to do things”
Michael Snyder
Professor and Chair of Genetics
Stanford University
Tal Nawy, End to end RNA Sequencing, Nature
Methods, v10, n10, Dec . 2013, p1144–1145

Ian Korf (2013) Genomics: the state of the art in


RNA-seq analysis, Nature Methods, Nov 26;10(12):1165-6.
doi: 10.1038/nmeth.2735.
PacBio’s Iso-Seq™ Method for High-quality, Full-length Transcripts
Experimental Pipeline
cDNA synthesis Size partitioning & SMRTbell™ PacBio® RS II
with adapters PCR amplification ligation Sequencing
a b
1 2 3 4 pipeline
Experimental 5
PolyA mRNA
AAAAA
TTTTT Informatics pipeline
AAAAA
TTTTT

AAAAA AAAAA
AAAAA
TTTTT
TTTTT polyA mRNA
AAAAA
AAAAA AAAAA
PacBio raw
AAAAA
TTTTT
TTTTT AAAAA sequence reads
AAAAA
AAAAA AAAAA
TTTTT AAAAA
AAAAA AAAAA
TTTTT AAAAA
Remove adapters
Remove artifacts

cDNA synthesis
SampleNet: Iso-Seq Method with Clonetech cDNA Synthesis Kit Clean
with adapters
sequence reads
AAAA
polyA TTTT
5’ primer Coding sequence
tail 3’ primer
AAAA
TTTT Reads clustering
Raw (AAA)nAAAA
n
TTTT
(TTT)n SMRT adapter
AAAA
TTTT SMRT adapter
(TTT)n Size partitioning & Isoform clusters
PCR amplification
AAAA
TTTT
Consensus calling
AAAA
TTTT

AAAA
TTTT Nonredundant
AAAA transcript isoforms
TTTT
Reads of Insert (AAA)nn
SMRTbell ligation Quality filtering

Informatics Pipeline Final isoforms


Evidenced-based
RS sequencing gene Map
models
to reference genome
6 7 8 9 10
PacBio raw Clean Nonredundant
Isoform
sequence sequence transcript Final isoforms
clusters
reads reads isoforms
Evidence-based gene models

Remove adapters Reads Consensus Quality Map to


Remove artifacts clustering calling filtering reference genome

DevNet: Iso-Seq wiki page


No Assembly required

Multiple isoforms observed at a single loci

Rat heart Rat lung

Tseng, PAG 2014, “ Isoform Sequencing: Unveiling the Complex Landscape of the Eukaryotic Transcriptome on the
PacBio® RS II” (poster)
“Gene Identification, Even in Well-Characterized Human
Cell Lines and Tissues, is Likely Far From Complete”

8,048 RefSeq-annotated, full-length isoforms and 5,459


predicted isoforms

“Over one-third of these are novel isoforms, including 273


RNAs from gene loci that have not previously been identified”
Au et al. (2013) Characterization of the human ESC transcriptome by hybrid sequencing. PNAS doi:
10.1038/pnas.1320101110.
ABRF NGS RNA-Seq Comparative Study:
Iso-Seq™ Application provides Most Uniform 5’ to 3’ Coverage
Splice Landscape of Neurexin 1a

Nrxn1α domain
structure

Splice isoform
abundance

(2,574 full-length
Nrxn1α mRNAs
247 unique sequence reads)
alternatively-
spliced
isoforms
6 SMRT® Cells

Exons
• green – present
• white – absent

Treutlein et al. (2014) Cartography of neurexin alternative splicing mapped by single-molecule long-read mRNA
sequencing. PNAS. doi:10.1073/pnas.1403244111
PacBio® Sequences Used for
Gene Model Validation in Lettuce

Without
PacBio
reads

Additional ~5000
gene models
Including validated
PacBio
reads

Confidence
PAG 2014, Marilena Christopouku “Targeted transcriptome analysis using PacBio sequencing to dissect multi-gene
families encoding NBS-LBR resistance proteins in lettuce”
PacBio® Iso-Seq Data Used to Confirm Predicted
Scaffolds in Norway Spruce Genome

14 SMRT® Cells
of PacBio data
using early
chemistry &
protocols

39

PAG 2014: Yao-Cheng Lin “PacBio cDNA sequencing of Norway spruce”


Selection of Additional Customer References/Publications

Click on graphic to hyperlink to example


Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, SMRTbell, and Iso-Seq are trademarks of Pacific
Biosciences in the United States and/or other countries. All other trademarks are the sole property of their respective owners.

You might also like