Deep Sequencing: Introduction To Bioinformatics Seminar November 9th, 2009

Deep Sequencing
Introduction to Bioinformatics Seminar

November 9th, 2009
Angela Benton, Samuel Darko, Prakriti Mudvari

and Prisca Takundwa
History of Sequencing

”Sanger Sequencing” developed by Fred Sanger et al
in the mid 1970’s

Uses dideoxynucleotides for ”chain termination”,
generating fragments of different lengths ending in
ddATP, ddGTP, ddCTP or ddTTP
http://openwetware.org/wiki/BE.109:Bio-material_engineering/Sequence_analysis
History of Sequencing Cont.
• A schematic of
Sanger sequencing
http://www.scq.ubc.ca/genome-projects-uncovering-the-blueprints-of-biology/
History of Sequencing Cont.

DNA fragments are separated by size by gel
electrophoresis

From the gel, the DNA sequence can be determined

Can produce DNA fragments 700-900bp long (good),
but it’s slow (bad)

Lots of other problems including clone library
generation and low throughput

The Human Genome Project used Sanger
sequencing, completion took over 10 years
Next Generation Sequencers
 Next (or 3rd) generation sequencers came onto the

scene in the early 2000’s
 General characteristics include:
 Amplification of genetic material by PCR
 Ligation of amplified material to a solid surface
 Sequence of the target genetic material is determined using
Sequence-by-Synthesis (using labelled nucleotides or
pyrosequencing for detection) or Sequence by ligation
 Sequencing done in a massively parallel fashion and
sequence information is captured by a computer
Next Gen. Sequencers Cont.
ABI3730xl Illumina
Sequencing Roche (454)
Genome Genome ABI SOLiD HeliScope
platform FLX
Analyzer Analyzer
Sequencing chemistry Automated Sanger Pyrosequencing Sequencing-by- Sequencing by Sequencing-by-

sequencing on solid support synthesis with ligation synthesis with virtual
reversible terminators
terminators
Template amplification In vivo amplification Emulsion PCR Bridge PCR Emulsion PCR None (single
method via cloning molecule)
Read length 700–900 bp 200–300 bp 32–40 bp 35 bp 25–35 bp
Sequencing throughput 0.03–0.07 Mb/h 13 Mb/h 25 Mb/h 21–28 Mb/h 83 Mb/h

Position
Cycle: G TEMPLATES
C A G T C A 1 2 3
- - G
C C -
- A A
G G -
- - T
C - -
A A A
Provided to author courtesy of Helicos representative
8
• Sequencing-by-ligation on SOLiD
http://www.umcutrecht.nl/subsite/genetics/Research/PersonalGenomics.htm
Next Gen vs Sanger

Let’s think about the domesticated silkworm
genome

The reference genome is about 432Mb large

It was assembled from approximately 8.5 fold
coverage
ABI3730xl Illumina
Roche (454) Helicos
Platform Genome Genome ABI SOLiD
FLX Heliscope
Analyzer Analyzer
Sequencing 0.03-0.07
13 Mb/h 25 Mb/h 21–28 Mb/h 83 Mb/h
Speed Mb/h
Time to
sequence 2185.7 11.8 6.1 5.5 1.8
(days)
Bioinformatics
 Because of the massively parallel nature of next gen

sequencers, huge amounts of data are produced
quickly requiring terabytes of storage
 New bioinformatics tools were developed to utilize the
huge number of much shorter reads (~35bp vs
~800bp)
 Bowtie - Ultrafast, memory-efficient short read aligner
 SOAPdenovo - Part of the SOAP suite, used to build
reference genome
 TopHat - TopHat is a fast splice junction mapper for RNA-
Seq reads
Applications

Novel whole genome sequencing

The Sorcerer II Global Ocean Sampling Expedition:
Northwest Atlantic through Eastern Tropical Pacific

Whole genome resequencing

Complete Resequencing of 40 Genomes Reveals
Domestication Events and Genes in Silkworm (Bombyx)

RNA-Seq (transcriptomics)

A Global View of Gene Activity and Alternative Splicing by
Deep Sequencing of the Human Transcriptome
Prisca Takundwa
NEXT-GENERATION SEQUENCING :
APPLICATIONS
APPLICATIONS
• The potential applications platform for next-

generation sequencing is enormous.
• Some examples that will be discussed include
application in
 Cellular Genomes using WGS
 Metagenomics
 Genomic Medicine
 Other novel applications
Cellular Genomes
• The advent of automation in Sequencing

initiated by Craig Venter et al gave rise to
sequencing beyond viruses and organelles.
• In 1995 Venter’s group at TIGR reported
complete sequences of two bacteria,
Haemophilus influenzae and Mycoplasma
genitalium.
Cellular Genomes
• Significance ;
 1st glimpse of the complete instruction set for
a living organism
 an approximation of the minimal set of
genes required for cellular life
 Insight into the methods used to come up
with these cellular genomes
Cellular Genomes
• Significance
 Paved the way for other cellular genomes
such as E.coli, Saccharomyces cerevisiae,
Caenorhabditis elegans, Drosophila
melanogoster
 Human Genome Project
 Next-generation appeal
Metagenomics
• Getting rid of cultures

• Introduces diversity, includes all genes and potentially
all members contributing to a given environment
• Typically use 16S rRNA gene to identify different
species and strains
• Advantages :
 Closes the huge gap in sequence data in non-model
species.
 Many prokaryotes are human pathogens
Metagenomics
• Some examples
• Breitbart et al showed that 2000 liters of sea
water contained >5000 different viruses. >1000
of these were found in human stool and majority
of these were new species.
• Craig Venter’s Global Ocean Voyage
Genomic Medicine
• Sequencing and how it lends itself to medicine

• Implications in diagnosis, treatment and
prevention
• Personalized medicine
• $1000 genome
• Some examples include Cancer and HIV
applications
Other Novel Applications
• Resequencing
• Plants – Sugar beet and Tropical Evergreen
Fagaceae
• Junk DNA
• Drug discovery
Transcriptomics
Angela Benton
Background
• Transcriptome – the complete set of coding and

non-coding RNA molecules in a cell at a
particular time
– Varies between cell types
• Transcriptomics – the study of the transcripts in
a cell, cell type, organism, etc.
Candidate Gene Analysis
1. Northern blot analysis

– Separation of RNA molecules by size
– Hybridization of a complementary radioactively-
labeled probe
– Detection method
2. Reverse transcriptase PCR (RT-PCR)
– RNA molecules reverse transcribed into cDNA
– PCR amplified
– Quantification method
Microarray Technology
• High-throughput gene expression profiling

• Hybridization of labeled cDNAs to an array of
complementary DNA probes
• Measurement of expression levels based on
hybridization intensity
Sequencing-Based Approaches
1. Full-length cDNA (FLcDNA) sequencing

– Complete sequencing of cDNA clone
2. Expressed sequence tag (EST) sequencing
– Single-pass sequencing of cDNA clone
3. Serial Analysis of Gene Expression (SAGE)
– Short sequence tags at 3’ end of transcript
– Tags concatenated and sequenced
RNA-Seq
• Alternative to Sanger sequencing
• RNA molecules converted into library of cDNA fragments
– Adaptors attached to one/both ends
• Short sequence reads obtained
• Aligned to reference genome and classified as:
1.Exons
2.Junctions
3.Poly-A ends
• Can be used to assemble de novo sequences
Next Generation Sequencing
Applications
Protein-coding gene annotation
– Transcriptome sequences can be aligned:
– To genome of same species
– To genome of related species
– Discovery of novel exons and introns
– Long read lengths – de novo analyses
– Short read lengths – novel splicing events
Applications
Gene expression profiling
– SAGE method
– 5’-RATE method (454 sequencing)
– 3’-UTR method (454 sequencing)
Applications
Noncoding RNA (ncRNA) discovery
– ncRNA not translated into protein product
– Role in regulation of development and cell fate
determination
– Three kinds:
1. Micro RNAs (miRNAs)
2. Small interfering RNAs (siRNAs)
3. Piwi-interacting RNAs (piRNAs)
Applications
Transcript rearrangement discovery
– Genome rearrangements common in human
cancers
– Includes:
1. Translocations
2. Inversions
3. Indels
4. Copy number variants
– Paired-end sequencing
– Infers presence of rearrangement
Bioinformatic Implications
• Large amounts of data generated

• Tools are needed to aid in:
– Storage
– Retrieval
– Processing
– Interpretation
– Integration
Bioinformatics of Deep Sequencing
Prakriti Mudvari
Bioinformatics of Deep
Sequencing
http://www.cbcb.umd.edu/research/viewer.jpg
The Basics.
http://www.k.u-tokyo.ac.jp/pros-e/person/shinichi_morishita/shinichi_morishita.htm
Creating a Paired End Tag
http://media.wiley.com/wires/WSBM/WSBM40/nfig001.jpg
Paired End vs. Unpaired Reads
• Millions of reads are generated.

• Repetitive regions within the genome cause the
reads to be mapped to multiple locations.
• Polymorphism in a read can cause it to be
mapped to a wrong location.
• Discarding ambiguous reads can reduce
coverage
Comparison of Output
ABI 3730 Illumina

Genome 454 Genome ABI SOLiD HeliScope
Analyzer Analyzer
700-900
Read Length 200-300 bp 32-40 bp 35 bp 25-35 bp
bp
Sequencing 0.03-0.07
13 Mb/h 25 Mb/h 21-28 Mb/h 83 Mb/h
throughput Mb/h
Challenges
• Quality of data
• Storage
• Cross Platform Analysis
• Data Annotation
• Assembly
• SNP/Mutation Detection
Bioinformatics Tools
• Alignment of reads to reference genome

• Assembly of de novo sequence
• Quality Control & Base Calling
• Polymorphism detection
• Genome browsing and annotation
Alignment of reads
• Reads generated from sequencing is mapped to

a reference genome
• Conventional tools like Blast or Blat do not work
well with short sequence reads.
• Modification of existing alignment algorithms to
handle short reads.
Alignment Tools
• Cross_match
• ELAND
• Exonerate
• MAQ
• Mosaik
• SHRiMP
• SOAP
• Zoom!
Short Oligonucleotide Alignment Program
(SOAP)
• Maps short oligonucleotides to reference sequence in a gapped or ungapped

alignment.
• Can be used for single as well as paired end alignments.
• Allows at most two mismatches per read or one continuous gap of size 1-3bp
when aligning. No mismatches allowed in the flanking region.
• Best hit is the one with least number of mismatches or smallest gap.
• Iteratively trims the several basepairs at 3’ end, that have highest number of
sequencing errors and realigns.
• Uses seed and hash-lookup algorithm to accelerate alignment.
• Loads reference sequence into memory instead of reads.
• Written in C++.
Assembly
• De novo sequencing involves assembling

overlapping reads to form contiguous sequence
of DNA.
• Done in cases where there’s no genomic
information available.
Assembly
• ABySS
• ALLPATHS
• Edena
• Euler-SR
SHARCGS
• SHRAP
• SSAKE
• Velvet
Assembly By Short Sequence
(ABySS)
• Originally developed for de novo assembly of large genomes
using short reads.
• Is a distributed representation of a de Bruijn graph that allows
parallel computation of algorithm across a network of computers.
• Assembly is done in two steps.
• First possible substrings of a specific length of sequence reads
are first generated. Substring dataset are then processed to
remove errors and contiguous sequences are built without using
paired end information.
• Mate pair information is then used to extend the contigs.
Assembly By Short Sequence
(ABySS)
• Use of paired end reads reduces the

ambiguity of repetitive regions.
• Written in C++ and uses Message Parsing
Interface to communicate between nodes.
Basecalling
Determination of nucleotide base depending on signal on

the trace file produced by a sequencer
http://stat.fsu.edu/~lilei/lilei/research/hmm/simulate.gif
Basecalling
• PyroBayes
• Alta-Cyclic
• BayesCall
Single Nucleotide Polymorphisms
(SNP) Detection
Sequence variation caused when a single

nucleotide base differs between different
members of species or between two
chromosomes of an individual.
SNP Detection
• PbShort
• ssahaSNP
Other Tools
• TagDust: Program for identifying and eliminating

artifacts from next generation sequencing data.
• ShortRead: Package for input, quality
assessment and exploration of high-throughput
sequence data.
The End
Thank you!
Questions?
References
 O. Morozova, et al. Applications of New Sequencing Technologies for

Transcriptome Analysis. Annu. Rev. Genomics Hum. Genet. 2009
 J.C. Vera, et al. 2001. Rapid transcriptome characterization for a nonmodel
organism using 454 pyrosequencing. Mol. Ecol. 17:1636-1637.
 N. Cloonan, et al. 2008. Stem cell transcriptome profiling via massive-scale
mRNA sequencing. Nat. Methods. 5:613-619.
 R.D. Morin, et al. 2008. Application massively parallel sequencing to
microRNA profiling and discovery in human embryonic stem cells. Genome
Res. 18(4):610-21
 R.K. Thomas, et al. 2007. High-throughput oncogene mutation profiling in
human cancer. Nat. Genet. 39:347-51.
 Z. Wang, et al. 2009. RNA-Seq: a revolutionary tool for transcriptomics.
Nature Review Genetics. 10(1):57-63.
 T.A. Brown. 2007. Genomics. Garland Science Publishing. Chapter 6.
References cont.
 Venter, C et al: The Sorcerer II Global Ocean Sampling Expedition: Northwest

Atlantic through Eastern Tropical Pacific. PLoS Biology, 2007.
 Sultan, M et al: A Global View of Gene Activity and Alternative Splicing by Deep
Sequencing of the Human Transcriptome. Science, 2008.
 Xia, Q et al: Complete Resequencing of 40 Genomes Reveals Domestication Events
and Genes in Silkworm (Bombyx). Science, 2009.
 Genome Res. 2008 November; 18(11): 1851–1858. 2008, Cold Spring Harbor
Laboratory PressMapping short DNA sequencing reads and calling variants using
mapping quality scoresHeng Li,1 Jue Ruan,2 and Richard Durbin1,3
 Multiplex parallel pair-end-ditag sequencing approaches in system biologyYijun Ruan,
Chia-Lin Wei *
Genome Technology & Biology Group, Genome Institute of Singapore, 60 Biopolis
Street, Singapore 138672
 "SOAP: short oligonucleotide alignment program" (2008) BIOINFORMATICS,Vol. 24
no.5 2008, pages 713–714 doi:10.1093/bioinformatics/btn025
References cont.
 Hutchinson, Clyde A. DNA Sequencing : bench to bedside and

beyond. Nucleid Acids Research,2007 Vol 35, No.18 6227-637
 Breitbart, M; Salamon P, Andresen B, Mahaffy JM, Segall AM,
Mead D, Azam F, Rohwer F (2002). "Genomic analysis of
uncultured marine viral communities". Proceedings of the National
Academy USA 99: 14250–14255.
 Himmelbauer et al, Plant Genomics in the era of high throughput
sequencing: The case of the sugar beet, Next Generation
Sequencing, 2009
 Kua, CS and Cannon, CH, Comparative genomics of Tropical
Evergreen Fagaceae, Next Generation Sequencing, 2009
 Liu George, Applications and Case Studies of the Next-Generation
Sequencing Technologies in Food, Nutrition and Agriculture,
Recent Patents on Food, Nutrition & Agriculture, 2009,1,75-79

Deep Sequencing: Introduction To Bioinformatics Seminar November 9th, 2009

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Deep Sequencing: Introduction To Bioinformatics Seminar November 9th, 2009

Uploaded by

Copyright:

Available Formats

Deep Sequencing

Introduction to Bioinformatics Seminar

Angela Benton, Samuel Darko, Prakriti Mudvari

 Next (or 3rd) generation sequencers came onto the

Sequencing chemistry Automated Sanger Pyrosequencing Sequencing-by- Sequencing by Sequencing-by-

Read length 700–900 bp 200–300 bp 32–40 bp 35 bp 25–35 bp

Sequencing throughput 0.03–0.07 Mb/h 13 Mb/h 25 Mb/h 21–28 Mb/h 83 Mb/h

 Because of the massively parallel nature of next gen

• The potential applications platform for next-

• The advent of automation in Sequencing

• Getting rid of cultures

• Sequencing and how it lends itself to medicine

• Transcriptome – the complete set of coding and

1. Northern blot analysis

• High-throughput gene expression profiling

1. Full-length cDNA (FLcDNA) sequencing

• Large amounts of data generated

• Millions of reads are generated.

ABI 3730 Illumina

• Alignment of reads to reference genome

• Reads generated from sequencing is mapped to

• Maps short oligonucleotides to reference sequence in a gapped or ungapped

• De novo sequencing involves assembling

• Use of paired end reads reduces the

Determination of nucleotide base depending on signal on

Sequence variation caused when a single

• TagDust: Program for identifying and eliminating

 O. Morozova, et al. Applications of New Sequencing Technologies for

 Venter, C et al: The Sorcerer II Global Ocean Sampling Expedition: Northwest

 Hutchinson, Clyde A. DNA Sequencing : bench to bedside and

You might also like