Deep Sequencing

Introduction to Bioinformatics Seminar November 9th, 2009 Angela Benton, Samuel Darko, Prakriti Mudvari and Prisca Takundwa

History of Sequencing 

´Sanger Sequencing´ developed by Fred Sanger et al in the mid 1970¶s Uses dideoxynucleotides for ´chain termination´, generating fragments of different lengths ending in ddATP, ddGTP, ddCTP or ddTTP

History of Sequencing Cont.
‡ A schematic of Sanger sequencing

History of Sequencing Cont. 

DNA fragments are separated by size by gel electrophoresis From the gel, the DNA sequence can be determined Can produce DNA fragments 700-900bp long (good), but it¶s slow (bad) Lots of other problems including clone library generation and low throughput The Human Genome Project used Sanger sequencing, completion took over 10 years    

Next Generation Sequencers 

Next (or 3rd) generation sequencers came onto the scene in the early 2000¶s General characteristics include: 

Amplification of genetic material by PCR Ligation of amplified material to a solid surface Sequence of the target genetic material is determined using Sequence-by-Synthesis (using labelled nucleotides or pyrosequencing for detection) or Sequence by ligation Sequencing done in a massively parallel fashion and sequence information is captured by a computer 

07 Mb/h 200±300 bp 13 Mb/h 32±40 bp 25 Mb/h 35 bp 21±28 Mb/h 25±35 bp 83 Mb/h .Next Gen.03±0. Sequencers Cont. Sequencing platform Sequencing chemistry ABI3730xl Genome Analyzer Automated Sanger sequencing Roche (454) FLX Pyrosequencing on solid support Illumina Genome Analyzer Sequencing-bysynthesis with reversible terminators ABI SOLiD HeliScope Sequencing by ligation Sequencing-bysynthesis with virtual terminators Template amplification method In vivo amplification via cloning Emulsion PCR Bridge PCR Emulsion PCR None (single molecule) Read length Sequencing throughput 700±900 bp 0.

Next Gen. . Sequencers Cont.

JB1 Next Gen. Position TEMPLATES Cycle: G C A G T C A 1 2 3 C G C A Provided to author courtesy of Helicos representative C A G A 8 G A T A . Sequencers Cont.

10/24/2007 G .C .A .Slide 8 JB1 Why is this not Following base addition order Jim Brayer.G .T .C .AShouldn't it be: G -'T'-C-A-G-T-C-A .

Next Gen.umcutrecht.htm . Sequencers Cont. ‡ Sequencing-by-ligation on SOLiD

Next Gen vs Sanger  Let¶s think about the domesticated silkworm genome  The reference genome is about 432Mb large It was assembled from approximately 8.07 Mb/h Roche (454) FLX Illumina Genome Analyzer 25 Mb/h ABI SOLiD Helicos Heliscope  Sequencing Speed Time to sequence (days) 13 Mb/h 21±28 Mb/h 83 Mb/h 2185.5 1.8 .03-0.7 11.5 fold coverage Platform ABI3730xl Genome Analyzer 0.1 5.8 6.

TopHat is a fast splice junction mapper for RNASeq reads  . memory-efficient short read aligner SOAPdenovo . huge amounts of data are produced quickly requiring terabytes of storage New bioinformatics tools were developed to utilize the huge number of much shorter reads (~35bp vs ~800bp)    Bowtie . used to build reference genome TopHat .Bioinformatics  Because of the massively parallel nature of next gen sequencers.Ultrafast.Part of the SOAP suite.

Applications  Novel whole genome sequencing  The Sorcerer II Global Ocean Sampling Expedition: Northwest Atlantic through Eastern Tropical Pacific  Whole genome resequencing  Complete Resequencing of 40 Genomes Reveals Domestication Events and Genes in Silkworm (Bombyx)  RNA-Seq (transcriptomics)  A Global View of Gene Activity and Alternative Splicing by Deep Sequencing of the Human Transcriptome .


‡ Some examples that will be discussed include application in  Cellular Genomes using WGS  Metagenomics Genomic Medicine Other novel applications .APPLICATIONS ‡ The potential applications platform for nextgeneration sequencing is enormous.

Cellular Genomes ‡ The advent of automation in Sequencing initiated by Craig Venter et al gave rise to sequencing beyond viruses and organelles. . ‡ In 1995 Venter¶s group at TIGR reported complete sequences of two bacteria. Haemophilus influenzae and Mycoplasma genitalium.

1st glimpse of the complete instruction set for a living organism an approximation of the minimal set of genes required for cellular life Insight into the methods used to come up with these cellular genomes .Cellular Genomes ‡ Significance .

Caenorhabditis elegans.Cellular Genomes ‡ Significance Paved the way for other cellular genomes such as E. Drosophila melanogoster Human Genome Project Next-generation appeal . Saccharomyces cerevisiae.coli.

Metagenomics ‡ Getting rid of cultures ‡ Introduces diversity. includes all genes and potentially all members contributing to a given environment ‡ Typically use 16S rRNA gene to identify different species and strains ‡ Advantages :   Closes the huge gap in sequence data in non-model species. Many prokaryotes are human pathogens .

‡ Craig Venter¶s Global Ocean Voyage .Metagenomics ‡ Some examples ‡ Breitbart et al showed that 2000 liters of sea water contained >5000 different viruses. >1000 of these were found in human stool and majority of these were new species.

Genomic Medicine ‡ Sequencing and how it lends itself to medicine ‡ Implications in diagnosis. treatment and prevention ‡ Personalized medicine ‡ $1000 genome ‡ Some examples include Cancer and HIV applications .

Other Novel Applications ‡ Resequencing ‡ Plants ± Sugar beet and Tropical Evergreen Fagaceae ‡ Junk DNA ‡ Drug discovery .

Transcriptomics Angela Benton .

. etc.Background ‡ Transcriptome ± the complete set of coding and non-coding RNA molecules in a cell at a particular time ± Varies between cell types ‡ Transcriptomics ± the study of the transcripts in a cell. cell type. organism.

Northern blot analysis ± Separation of RNA molecules by size ± Hybridization of a complementary radioactivelylabeled probe ± Detection method 2. Reverse transcriptase PCR (RT-PCR) ± RNA molecules reverse transcribed into cDNA ± PCR amplified ± Quantification method .Candidate Gene Analysis 1.

Microarray Technology ‡ High-throughput gene expression profiling ‡ Hybridization of labeled cDNAs to an array of complementary DNA probes ‡ Measurement of expression levels based on hybridization intensity .

Sequencing-Based Approaches 1. Full-length cDNA (FLcDNA) sequencing ± Complete sequencing of cDNA clone 2. Expressed sequence tag (EST) sequencing ± Single-pass sequencing of cDNA clone 3. Serial Analysis of Gene Expression (SAGE) ± Short sequence tags at 3¶ end of transcript ± Tags concatenated and sequenced .

Junctions 3.RNA-Seq ‡ Alternative to Sanger sequencing ‡ RNA molecules converted into library of cDNA fragments ± Adaptors attached to one/both ends ‡ Short sequence reads obtained ‡ Aligned to reference genome and classified as: 1.Poly-A ends ‡ Can be used to assemble de novo sequences .Exons 2.

Next Generation Sequencing Applications Protein-coding gene annotation ± Transcriptome sequences can be aligned: ± ± To genome of same species To genome of related species ± Discovery of novel exons and introns ± Long read lengths ± de novo analyses ± Short read lengths ± novel splicing events .

Next Generation Sequencing Applications Gene expression profiling ± SAGE method ± 5¶-RATE method (454 sequencing) ± 3¶-UTR method (454 sequencing) .

2. Micro RNAs (miRNAs) Small interfering RNAs (siRNAs) Piwi-interacting RNAs (piRNAs) . 3.Next Generation Sequencing Applications Noncoding RNA (ncRNA) discovery ± ncRNA not translated into protein product ± Role in regulation of development and cell fate determination ± Three kinds: 1.

4. 2.Next Generation Sequencing Applications Transcript rearrangement discovery ± Genome rearrangements common in human cancers ± Includes: 1. 3. Translocations Inversions Indels Copy number variants ± Paired-end sequencing ± Infers presence of rearrangement .

Bioinformatic Implications ‡ Large amounts of data generated ‡ Tools are needed to aid in: ± Storage ± Retrieval ± Processing ± Interpretation ± Integration .

Bioinformatics of Deep Sequencing Prakriti Mudvari .

Bioinformatics of Deep Sequencing http://www.umd.cbcb.jpg .edu/research/viewer.

http://www.k.htm .jp/pros-e/person/shinichi_morishita/ Basics.

wiley.jpg .com/wires/WSBM/WSBM40/nfig001.Creating a Paired End Tag http://media.

‡ Discarding ambiguous reads can reduce coverage . Unpaired Reads ‡ Millions of reads are generated. ‡ Repetitive regions within the genome cause the reads to be mapped to multiple locations. ‡ Polymorphism in a read can cause it to be mapped to a wrong location.Paired End vs.

07 Mb/h 454 Illumina Genome Analyzer 32-40 bp ABI SOLiD HeliScope 200-300 bp 35 bp 25-35 bp Sequencing throughput 13 Mb/h 25 Mb/h 21-28 Mb/h 83 Mb/h .03-0.Comparison of Output ABI 3730 Genome Analyzer Read Length 700-900 bp 0.

Challenges ‡ Quality of data ‡ Storage ‡ Cross Platform Analysis ‡ Data Annotation ‡ Assembly ‡ SNP/Mutation Detection .

Bioinformatics Tools ‡ Alignment of reads to reference genome ‡ Assembly of de novo sequence ‡ Quality Control & Base Calling ‡ Polymorphism detection ‡ Genome browsing and annotation .

Alignment of reads ‡ Reads generated from sequencing is mapped to a reference genome ‡ Conventional tools like Blast or Blat do not work well with short sequence reads. ‡ Modification of existing alignment algorithms to handle short reads. .

Alignment Tools ‡ Cross_match ‡ ELAND ‡ Exonerate ‡ MAQ ‡ Mosaik ‡ SHRiMP ‡ SOAP ‡ Zoom! .

Can be used for single as well as paired end alignments. Written in C++. No mismatches allowed in the flanking region. that have highest number of sequencing errors and realigns.Short Oligonucleotide Alignment Program (SOAP) ‡ ‡ ‡ ‡ ‡ ‡ ‡ ‡ Maps short oligonucleotides to reference sequence in a gapped or ungapped alignment. Allows at most two mismatches per read or one continuous gap of size 1-3bp when aligning. Iteratively trims the several basepairs at 3¶ end. Uses seed and hash-lookup algorithm to accelerate alignment. Best hit is the one with least number of mismatches or smallest gap. . Loads reference sequence into memory instead of reads.

‡ Done in cases where there¶s no genomic information available.Assembly ‡ De novo sequencing involves assembling overlapping reads to form contiguous sequence of DNA. .

Assembly ‡ ABySS ‡ ALLPATHS ‡ Edena ‡ Euler-SR SHARCGS ‡ SHRAP ‡ SSAKE ‡ Velvet .

Substring dataset are then processed to remove errors and contiguous sequences are built without using paired end information. ‡ First possible substrings of a specific length of sequence reads are first generated. ‡ Assembly is done in two steps. . ‡ Mate pair information is then used to extend the contigs. ‡ Is a distributed representation of a de Bruijn graph that allows parallel computation of algorithm across a network of computers.Assembly By Short Sequence (ABySS) ‡ Originally developed for de novo assembly of large genomes using short reads.

‡ Written in C++ and uses Message Parsing Interface to communicate between nodes. .Assembly By Short Sequence (ABySS) ‡ Use of paired end reads reduces the ambiguity of repetitive regions.

Basecalling Determination of nucleotide base depending on signal on the trace file produced by a sequencer .

Basecalling ‡ PyroBayes ‡ Alta-Cyclic ‡ BayesCall .

.Single Nucleotide Polymorphisms (SNP) Detection Sequence variation caused when a single nucleotide base differs between different members of species or between two chromosomes of an individual.

SNP Detection ‡ PbShort ‡ ssahaSNP .

. ‡ ShortRead: Package for input. quality assessment and exploration of high-throughput sequence data.Other Tools ‡ TagDust: Program for identifying and eliminating artifacts from next generation sequencing data.

The End Thank you! Questions? .

2007. Vera.  R. Rapid transcriptome characterization for a nonmodel organism using 454 pyrosequencing. Genomics Hum.References  O. Morin. 2009. Nat. Chapter 6. Cloonan. 17:1636-1637.  T. Methods. 2007. Nature Review Genetics. Genet. 2009  J.A. 2008. Genet. et al. 5:613-619. Brown. et al. Garland Science Publishing. Annu. Morozova. Rev. .  Z. High-throughput oncogene mutation profiling in human cancer. Thomas. Nat. 10(1):57-63. 39:347-51. et al. Wang. et al.  N. Ecol. 18(4):610-21  R. 2001. et al. Stem cell transcriptome profiling via massive-scale mRNA sequencing. et al. Genome Res. 2008. Mol. Genomics.D. RNA-Seq: a revolutionary tool for transcriptomics.K.C. Application massively parallel sequencing to microRNA profiling and discovery in human embryonic stem cells. Applications of New Sequencing Technologies for Transcriptome Analysis.

Science.Vol. M et al: A Global View of Gene Activity and Alternative Splicing by Deep Sequencing of the Human Transcriptome.1093/bioinformatics/btn025 . Q et al: Complete Resequencing of 40 Genomes Reveals Domestication Events and Genes in Silkworm (Bombyx). Singapore 138672  "SOAP: short oligonucleotide alignment program" (2008) BIOINFORMATICS. Cold Spring Harbor Laboratory PressMapping short DNA sequencing reads and calling variants using mapping quality scoresHeng Li. 2008.1 Jue Ruan.References cont. Science.5 2008. 18(11): 1851±1858. pages 713±714 doi:10. 2008.2 and Richard Durbin1. Genome Institute of Singapore. 24 no.  Genome Res. 2008 November. 60 Biopolis Street.  Venter.3  Multiplex parallel pair-end-ditag sequencing approaches in system biologyYijun Ruan. PLoS Biology.  Sultan. 2009. Chia-Lin Wei * Genome Technology & Biology Group. C et al: The Sorcerer II Global Ocean Sampling Expedition: Northwest Atlantic through Eastern Tropical Pacific.  Xia. 2007.

CH. Next Generation Sequencing. 2009  Kua. Rohwer F (2002). CS and Cannon. Next Generation Sequencing. Comparative genomics of Tropical Evergreen Fagaceae. Nucleid Acids Research.References cont. M. Proceedings of the National Academy USA 99: 14250±14255. Plant Genomics in the era of high throughput sequencing: The case of the sugar beet. Recent Patents on Food.  Hutchinson.  Himmelbauer et al. 2009. "Genomic analysis of uncultured marine viral communities". Clyde A.2007 Vol 35.18 6227-637  Breitbart.1. Mead D. 2009  Liu George. No.75-79 . Andresen B. Azam F. Nutrition and Agriculture. Nutrition & Agriculture. Salamon P. Segall AM. Applications and Case Studies of the NextGeneration Sequencing Technologies in Food. Mahaffy JM. DNA Sequencing : bench to bedside and beyond.