You are on page 1of 110

DNA SEQUENCING

Dr Z Chikwambi
Biotechnology
Objectives
• What is DNA Sequencing ?

• History of development

• Basic Methods- Chain termination and Chemical


modification method
What is DNA Sequencing ?

• Determining the precise order of nucleotides


in DNA.

• We need to determine the order of nucleotide


bases in a strand of DNA for sequencing.
The Need for DNA Sequencing
• Gene isolation
• Sequence charaterization
• Forensics
• Molecular Archeology
• Gene Gene Interaction
• Gene Protein Interaction
• Cloning
Deoxyribonucleic Acid
• Deoxyribonucleic acid (DNA) is a nucleic acid that
functions for:

– Storage of genetic information


– Self-duplication & inheritance
– Expression of the genetic message

• DNA’s major function is to code for proteins. Information


is encoded in the order of the nitrogenous bases.

Adenosine Cytosine Guanine Thymine


DNA
• Deoxyribonucleic Acid
• Stores genetic information
• Four different nucleotides A,T,G,C
• DNA comprises of a long molecule analogous to a chain,
while the links of the chain are called Nucleotides
Historical Timeline
1870 – Miescher discovers DNA
1940 - Avery: Proposes DNA as ‘Genetic Material’
1953 – Watson & Crick “double helical structure”
1970 - Wu: Sequences λ Cohesive End DNA
1977 – Sanger: Dideoxy Chain Termination
1977 – Gilbert: Chemical Degradation
1986 – Partial Automation
1990 – Cycle Sequencing, Improved Sequencing Enzymes,
Improved fluorescent detection schemes
2002 – NGS: 454 , pyro sequencing
Cost per Genome
Sequencing Methods
• To determine the order of the nucleotide
bases adenine, guanine, cytosine, and
thymine in a molecule of DNA two methods
were used
1. Maxam and Gilbert; Chemical Sequencing
2. Sanger; Chain Termination Sequencing
• These two are conventional methods
• Robotics and automated sequencing are
based on these methods
Maxam and Gilbert Method
• In 1976–1977, Allan Maxam and Walter Gilbert
developed a DNA sequencing method based on
chemical modification of DNA and subsequent
cleavage at specific bases

I. Chemical Modification of DNA; radioactive labeling


at one 5' end of the DNA (typically by a kinase
reaction using gamma-32P ATP)
II. Purification of the DNA fragment to be sequenced
III. Chemical treatment generates breaks in DNA
IV. Run on the gel
Chemical Modification and
Cleavage
• Ploy nucleotide Kinase radioactive label at one
5' end of the DNA using gamma-32P
5′ G A C G T G C A A C G A A 3′

32
P 5′ G A C G T G C A A C G A A 3′
Chemical Modification and
Cleavage
• Base Modification using Dimethyl sulphate
– Purine
• Adenine
• Guanine
– Only DMS------- G
– DMS+ Formic acid-------G+A

• Cleavage of Sugar Phosphate backbone using


Piperidine
Chemical Modification and
Cleavage
• Base modification using Hydrazine
– Pyrimidine
• Cytocine
• Thymidine
– Hydrazine----- C+T
– Hydrazine + NaCl--------C

• Cleavage of Sugar Phosphate backbone using


Piperidine
Maxam Gilbert Sequencing

DMS FA H H+S

G G C C
G A T C
G G T
G G C C
C
A T
G C
A C
A T

P 5′ G A C G T G C A A C G A 3′
32
Maxam-Gilbert Sequencing
3′
A
G
C
A
G G+A T+C C A
C
Longer fragments G
T
A G
C
A
G
Shortest fragments 5′

Sequencing gels are read from bottom to top (5′ to 3′).

P 5′ G A C G T G C A A C G A 3′
32
Maxam Gilbert Sequencing: Process Summarized

1. Label 5’- end of DNA


2. Aliqot DNA sample in 4 tubes
3. Perform base modification reaction
4. Perform Cleavage reaction
5. Perform Gel Electrophoresis
6. Perform Autoradiography
7. Interpret results
Sanger; Chain Termination Sequencing

• It is PCR based method


• A modified DNA replication reaction
• Growing chains are terminated by
dideoxynucleotides
The 3′-OH group necessary for formation of the phosphodiester bond is missing in ddNTPs
Sanger; Chain Termination
Sequencing
A G C T G C C C G
ddATP + ddA
A
four dNTPs dAdGdCdTdGdCdCdCdG

ddCTP + dAdGddC
C
four dNTPs dAdGdCdTdGddC
dAdGdCdTdGdCddC
dAdGdCdTdGdCdCddC

ddGTP + dAddG
G four dNTPs dAdGdCdTddG
dAdGdCdTdGdCdCdCddG

T
ddTTP + dAdGdCddT
four dNTPs dAdGdCdTdGdCdCdCdG
Chain Termination Sequencing

3′
G
G A T C G
Longer fragments T
A
ddG
A
A
T
C
Shorter fragments A
ddG T
G
5′

Sequencing gels are read from bottom to top (5′ to 3′)


Sanger Sequencing: An Example
5’-TACACGATCGA-3’
3’-ATGTGCTAGCT-5’

Denature the sequence


Use only forward primer i.e. using 3’-5
Amplification in ddTTP Amplification in ddATP

3’-ATGTGCTAGCT-5’
3’-ATGTGCTAGCT-5’
5’-TA-3’
5’-T-3’ 5’-TACA-3’
5’-TACACGAT-3’ 5’-TACACGA-3’
5’-TACACGATCGA-3’
Amplification in dGTTP Amplification in ddCTP

3’-ATGTGCTAGCT-5’ 3’-ATGTGCTAGCT-5’
5’-TACACG-3’ 5’-TAC-3’
5’-TACACGATCG-3’ 5’-TACAC-3’
5’-TACACGATC-3’
Reading Sequence
BAND ddATP ddTTP ddGTP ddCTP 3’ 3’
12 bp
11 bp
10 bp
9 bp
8 bp
7 bp
6 bp
5 bp
4 bp
3 bp
2 bp
1 bp
5’ 5’
Sanger Sequencing: Process Summarized

1. Get enough quantity of DNA (Run PCR)


2. Aliqot DNA into four different tubes
3. Prepare PCR reaction mix as below:
• Primer, taq PM, template(ss DNA), dNTPS (All)
and ddNTPs(ddATP, ddGTP,ddCTP & ddTTP
respectively)
4. Run PCR
5. Perform Gel Electrophoresis
6. Interpret results
Principles of DNA Sequencing

Primer
DNA fragment

Amp

PBR322

Tet

Ori Denature with Klenow + ddNTP


heat to produce + dNTP + primers
ssDNA
COMPARISON
Sanger Method Maxam Gilbert Method
Enzymatic Chemical

Requires DNA synthesis Requires DNA

Termination of chain Breaks DNA at different


elongation nucleotides

Automation Automation is not available

Single-stranded DNA Double-stranded or single-


stranded DNA
Dye Sequencing
• Four different labels
– Each of the four nucleotide chains has a different
dye
– Individual dyes fluoresce at unique wavelengths
• Vast majority of sequencing projects
– easier
– cheaper
Sample: Dye Sequencing Output
Automated procedure for DNA
sequencing

A computer read-out of the gel generates a “false color” image


where each color corresponds to a base. Then the intensities are
translated into peaks that represent the sequence.
High-throughput seqeuncing:
Capillary electrophoresis
The human genome project Sheath flow

has spurred an effort to Laser


develop faster, higher Focusing Sheath flow cuvette
throughput, and less lens

expensive technologies
for DNA sequencing.
Capillary electrophoresis Beam block
(CE) separation has many PMT
Collection Lensc

advantages over slab gel filter

separations. CE separations are faster and are capable of producing


greater resolution. CE instruments can use tens and even
hundreds of capillaries simultaneously. The figure show a simple
CE setup where the fluorescently-labeled DNA is detected as it
exits the capillary.
Sieving matrix for CE
• It is not easy to analyze DNA in capillaries filled only with buffer.
• That is because DNA fragments of different lengths have the same
charge to mass ratio.
• To separate DNA fragments of different sizes the capillary needs to
be filled with sieving matrix, such as linear polyacrylamide
(acrylamide polymerized without bis-acrylamide).
• This material is not rigid like a cross-linked gel but looks much like
glycerol. With a little bit of effort it can be pumped in and out of
the capillaries.
• To simulate the separation characteristics of an agarose gel one can
use hydroxyethylcellulose. It is not much more viscous than water
and can easily be pumped into the capillaries.
Fluorescent end labeling of DNA
The Basics of NGS Chemistry
• NGS technology is similar to CE sequencing.
• DNA polymerase catalyzes the incorporation of fluorescently
labeled deoxyribonucleotide triphosphates (dNTPs) into a
DNA template strand during sequential cycles of DNA
synthesis.
• During each cycle, at the point of incorporation, the
nucleotides are identified by fluorophore excitation.
• The critical difference is that, instead of sequencing a single
DNA fragment,
• NGS extends this process across millions of fragments in a
massively parallel fashion, sequencing by synthesis (SBS)
chemistry.*
The Shotgun Sequencing Principle

Isolate ShearDNA Clone into


Chromosome into Fragments Seq. Vectors Sequence & reassemble
Contigs
Reconstruct chromosome
Shotgun Sequencing

Sequence Send to Computer Assembled


Chromatogram Sequence
The Finished Product: DNA
Sequence
GATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAG
ATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGA
TTAGAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGAT
TACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATT
ACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTA
CAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTAC
AGATTACAGATTACAGAT
Chromatogram Editing
Sequence Analysis
Sequence Alignment
Sequence Assembly
• Used to assemble DNA contigs
» Fragmented data from DNA sequencers
• Usually based on a reference sequence/ genome
– Unknown regions mapped onto annotated homologous
region on the reference.
» Reference mapping based on alignment and BLAST algorithms
• Includes detection of statistically significant overlaps
» Used for merging neighbouring contigs (assembly)
• Small contigs ultimately assembled into a continuous
sequence

5' 3'
Sequence assembly

(next generation sequencing)


Genome/ Contig Alignment:
Process
Different clone
contigs/genome
fragments of DNA
sequences
ATCGATGCGTAGC
TAGCAGACTACCGTT
GTTACGATGCCTT
TGCTACGCATCG CGATGCGTAGCA

CGATGCGTAGCA
ATCGATGCGTAGC
TAGCAGACTACCGTT
GTTACGATGCCTT

ATCGATGCGTAGCAGACTACCGTTACGATGCCTT…
Assembled sequence
High Throughput DNA
Sequencing:
Next Generation Sequencing
(NGS) Technologies
Available Next-generation Sequencing
Platforms

• Solexa, Illumina
• SOLiD, Applied Biosystems (ABI)
• 454, Roche
• Polonator
• HeliScope
• …
7x Illumina GA-II 2x Roche 454 1x Illumina HiSeq 2000
DNA Sequencing Capability Has Grown
Exponentially

DNA sequences in GenBank

Doubling time = 18 months


Next Generation Sequencing
• 454 Life Sciences/Roche
– Genome Sequencer FLX: currently produces 400-600 million
bases per day per machine
– Published 1 million bases of Neanderthal DNA in 2006
– May 2007 published complete genome of James Watson (3.2
billion bases ~20x coverage)
• Solexa/Illumina - sequencing by synthesis
– 10 GB per machine/week
– May 2008 published complete genomes for 3 hapmap subjects
(14x coverage)
• ABI/ SOLiD
– SOLiD = (Sequencing by Oligonucleotide Ligation and Detection)
– 20 GB per machine/week
“Paradigm Shift”

• Standard ABI “Sanger” sequencing


– 96 samples/day
– Read length ~650 bp
– Total = 450,000 bases of sequence data
• 454 was the game changer!
– ~400,000 different templates (reads)/day
– Read length ~250 bp
– Total = 100,000,000 bases of sequence data!!!
Solexa Steps Up The Game

• Solexa (Illumina GA)


– 60,000,000 different sequence templates
(yes that is an insane 60 million reads)
– 36 bp read length
– 4 billion bases of DNA per run (3 days)
Principles of GNS Technology:
Nanotechnology
• Each system works differently, but they are all
based on a similar principles:
– Shear target DNA into small pieces
– Bind individual DNA molecules to a solid surface,
– Amplify each molecule into a cluster
– Copy one base at a time and detect different signals for
A, C, T, & G bases
– Requires very precise high-resolution imaging of tiny
features
• (Solexa has 800 images @ 4 megapixels each)
Huge Amount of Image Data
• The raw image data is truly huge:
• 1-2 TB for the Solexa, more for ABI-SOLID, less for 454

• The images are immediately processed into


intensity data (spots w/ location and brightness)
• Intensity data is then processed into base-calls
(A, C, T, or G plus a quality score for each)
• Base-call data is on the order of 5-10 GB per
run (or a week of runs for 454).
Comparison of Existing 2nd generation
DNA sequencing technologies

Jay Shendure & Hanlee Ji, Nature Biotechnology 26, 1135 - 1145 (2008)
Applications of Next-generation
Sequencing
Third Generation Technologies

• Nanopore sequencing technology


– Nucleic acids driven through a nanopore.
– Differences in conductance of pore provide readout.

• Real-time monitoring of PCR activity


– Read-out by fluorescence resonance energy transfer
between polymerase and nucleotides or
– Waveguides allow direct observation of polymerase
and fluorescently labeled nucleotides
NGS Technology Primary Tools

1. Image processing (unique to each manufacturer)


2. Basecalling – techniques for identifying the
nucleotide sequence (unique to each manufacturer)
3. Align sequence reads to reference genome
4. Assemble contigs and whole genomes using quality
scores and/or paired-end information
5. Annotation, SNP calling/genotyping
– In the case of RNA sequence, transcript profiling can be
done thereafter: Measurement of gene expression,
identifying alternative splicing, etc.
NGS Principles: Cyclic-array DNA
Sequencing Methods
1. DNA is fragmented
2. Adaptors ligated to fragments
3. Several possible protocols yield
array of PCR colonies.
4. Enyzmatic extension with
fluorescently tagged nucleotides.
5. Cyclic readout by imaging the
array.
Clonal Amplification Features of Second-
Generation Sequencing Technologies

(a) The Roche 454 and ABI SOLiD platforms rely on emulsion PCR to amplify clonal
sequencing features. An adaptor-flanked shotgun library is PCR amplified in a water-
in-oil emulsion. One of the PCR primers is 5'-anchored to the surface of micron-scale
beads. PCR amplicons are held on the surface of the bead as clonal amplicons. 454
detects light signal using luciferase activity on ddNTP incorporation.

(b) The Solexa technology relies on bridge PCR (aka 'cluster PCR') to amplify clone
DNAs. An adaptor-flanked shotgun library is PCR amplified using both primers that
densely coat the surface of a solid substrate, attached at their 5' ends by a flexible
linker. Amplification products originating from any given member of the template form
a clonal cluster (approx. 1,000 copies.
NGS Principles: Emulsion PCR

• DNA fragments, with adaptors, are PCR amplified


within a water drop in oil (emulsion).
• One primer is attached to the surface of a bead.
• Used by:
» 454 (Roche)
» Polonator and
» SOLiD Applied Biosystems (ABI)
454 “Pyrosequencing”

• First high-throughput DNA sequencer


• 1st commercially available in 2004
• Uses pyrosquencing, beads, and a microtiter plate
• Signal detection based on light emitted by luciferase
activity on incorporation of labeled ddNTP
NGS Principles: Bridge PCR

1. DNA fragments are flanked with adaptors (A & B).


2. A flat surface coated with two types of primers (F & R)
– Each primer respectively corresponds to the adaptors (A & B)
» Ie. Complementary to a respective adaptor.
3. Amplification proceeds in cycles, with one end of each bridge
anchored to the surface.
4. Used by Solexa (Illumina).
Illumina
Genome Analyzer

• Originally developed by Solexa, now subsidiary of


Illumina.
• Commercially available in 2006
• Now produces 8-12 million reads per sample of 36
bp length = 10 GB/week.
• Run takes 3 days for 7 samples.
• Low error rate, mostly base changes, few indels
Illumina Genome Analyzer

Richard K. Wilson
Example: Illumina/Solexa
Sequencing
Example: Illumina/Solexa
Sequencing
Example: Illumina/Solexa Sequencing

Stages 7-9 (Base calling)


Image 1st and 2nd bases & use them to image sequence reads over multiple cycles
Stage 10 (Align data)
More than 50 milliion clusters/flow cell, each 1000 copies of the same template produce
1 billion bases per run.
Imaging One (of 800) Tiles on
Solexa Sequencer
Strategies for cyclic array sequencing
(a)454 platform: Clonally amplified beads generated by
emulsion PCR serve as sequencing features deposited
into tiny wells. With pyrosequencing, each cycle
consists of the introduction of a single nucleotide
species, followed by addition of substrate (luciferin,
adenosine 5'-phosphosulphate) to drive light production
at wells where polymerase-driven incorporation of that
nucleotide took place. This is followed by an apyrase
wash to remove unincorporated nucleotide.

(b) Solexa technology: A dense array of clonally


amplified sequencing products is generated directly on a
surface by bridge PCR. Each sequencing cycle includes
the simultaneous addition of a mixture of four modified
dNTPs, each with a fluorescent label. A modified DNA
polymerase drives extension of primed sequencing
followed by imaging in 4 channels and then cleavage of
both the fluorescent labels and the terminating moiety.

(c) SOLiD platform, clonally amplified beads are used to


generate a disordered, dense array of sequencing
features. Sequencing is performed with a ligase, rather
than a polymerase. After ligation and imaging in four
channels, the labeled portion of the octamer is cleaved.
(d) HeliScope platform (N/A)
Analysis Tasks
• Base calling
– Depends on the detection method/technology used
» Which ddNTP was incorporated
» Where in the sequence

• Polymorphism detection
» Is it a mutation or sequencing error
» Eg. SNPs

• Sequence & sequencing statistics


» Coverage - How many times was the genome covered
» Quality scores – which sequences are of good/bad quality

• Mapping to a reference genome


» Sequence assembly and annotation using a reference genome

• De novo (from scratch) or assisted genome assembly


» Sequence assembly based on statistical overlapping
NGS Data Analysis: Sequence
Statistics

• General characteristics of genome/ regions determined


• Eg., Average contig length, %GC content, etc.
NGS Data Analysis: Coverage Statistics

Coverage of genome also important


NGS Data Statistical Analysis: Quality
scores (Q Scores)

DNA quality deteriorates with cycle number. The quality of sequences can be selected
On that basis using the Phred scale.
NGS Data Analysis Principles
Sequence Quality: Phred Scores
• Phred score = 10 log10( probability of error )

• Measures the quality of short fragment NGS


sequences based on probability of error.
» Poor sequences are first removed before analysis.
– Phred value of 10: 10% error probability
– Phred value of 20: 1% error probability
– Phred value of 30: 0.1% error probability (one in 1,000)
» Phred value of 20 usually a standard measure of good quality
» Used as a cut off to determine good sequence quality
» Poor quality sequences discarded
Mapping of Sequence Reads

• First processing step after sequencing:


– Read mapping (using a reference genome)
– Assembly (no reference sequence; specialized analyses)

• Quality of mapping determines downstream results


– Accessible genome
– Biases (ref vs. variant)
– Sensitivity (divergent reference; SNPs, indels, SV)
– Specificity (calibration of mapping quality)
Sequence Assembly &/or Read Mapping
• NGS data processing requires web based programming tools for quality
sequence selection, alignment, assembly, mapping & annotation.
• Good quality sequences first selected for assembly (using the Phred scale)
• Alignment with a reference sequence/ genome (web based programming)
• Annotation based on homology/alignments & statistical analysis of overlaps
Sequence Assembly &/or Read Mapping
Sequence Assembly &/or Mapping
• de novo assembly = from scratch
– Relies more on statistical overlaps between contigs
• Can sequence the entire genome of a microbe in a
single run
• Important for new species/strains
• Challenge of assembly with short reads
– Produces from a few bp up to about 250 bp good quality
compared to about 750 bp using Sanger method.
– 8x coverage of 3 GB genome = 750 million fragments
• Big problem with repeat sequences
– Assembly of contigs used to fill gaps
• Paired-end reads are essential
• NB: Assembly can also be based on a reference
sequence/genome instead of de novo
– Will require independent annotation using programming
Sequence Assembly Programs
• Phred - base calling program that does
detailed statistical analysis (UNIX)
» http://www.phrap.org/

• Phrap - Sequence assembly program (UNIX)


» http://www.phrap.org/

• TIGR Assembler - Microbial genomes (UNIX)


» http://www.tigr.org/softlab/assembler/

• The Staden Package – UNIX program


» http://www.mrc-lmb.cam.ac.uk/pubseq/

• GeneTool/ChromaTool/Sequencher
(PC/Mac)
Assembly: Read Length & Pairing
• Short reads are problematic, because short
sequences do not map uniquely to the genome.
– Solution #1:
» Get longer reads.
– Solution #2:
» Get paired-end reads.
• The term 'paired ends' refers to the two ends
of the same DNA molecule.
– Can sequence one end, then turn it around and sequence the
other end.
– The two sequences that result are 'paired end reads'.
» Sometimes they're called 'mate pairs'
Paired End Reads are Important!
Known Distance

Read 1 Read 2

Repetitive DNA
Unique DNA

Paired read maps uniquely

Single read maps to


multiple positions
ABI-SOLiD: Paired End Reads
• First commercially HTS available in late 2007
• Currently capable of producing 20 GB of data per run
(week)
• Most users generate 6 GB/run
• Reads ~30 bp long
• Uses unique
• Sequence-by-ligation method
• “Color-space” data coding
• Very low error rate
• Allows reading both the forward & reverse template
strands
• Both reads contain long range positional
information, allowing for highly precise
alignment of reads
• Relies on streptavidin, a protein which binds
very tightly to the small molecule, biotin.
Genome Browsers
Genome Browsers
• A genome browser is a graphical interface for
display of information from a biological database
for genomic data.
• Differ from ordinary biological databases in that
they display data in a graphical format
• Genome browsers enable researchers to
visualize and browse entire genomes.
• Genome data is annotated
• Includes gene prediction and structure, proteins,
expression, regulation, variation, comparative
analysis, etc.
Genome Browsers: Artemis
 Annotation platform used by Sanger Institute
Pathogen group
 Can be used as a visualization tool
 Capable of reading EMBL/GenBank and GFF
files
 Built in database entry fetching facility
 Designed for bacterial genomes
 Used for small eukaryotes
 Limited use for larger, heavily spliced
genomes

http://www.sanger.ac.uk/Software/Artemis/v10/
Genome Browsers: GBrowse
• Developed by Lincoln Stein
(CSHL/Toronto)
• Very flexible
• Works with flat files, GFF
databases, CHADO RDBs
and via adaptors from
EMBL/GenBank files
• Users include PlasmoDB,
FlyBase and WormBase.

http://gmod.org/wiki/index.php/Gbrowse
Genome Browsers: Ensembl
• Bespoke browser from
the Ensembl group
• Many different display
pages for a variety of
data types
• Users include
Ensembl, VectorBase,
Gramene

http://www.ensembl.org/index.html
Genome Browswers: Apollo
 Annotation platform used by FlyBase and
TAIR groups.
 Developed through Berkeley/EBI/GMOD.
 Can be used as a visualization tool
 Capable of reading CHADO-XML and GFF files
 Adaptors for CHADO RDB and Ensembl
databases

http://gmod.org/wiki/index.php/Apollo
DNA sequenced;

DNA/ Genome assembled:

So what;

Is that good enough?


Other Issues

• Sequence deposition.
– Into appropriate biological databases.

• Annotation.
– Giving meaning to the sequences.
– Assigning function to the DNA/ genome
sequence.
Some computational problems
• De novo assembly

• Read mapping , SNP calling, quantification.

• Downstream association studies


Assembly as a software engineering
problem
• A single sequencing experiment can generate
100’s of millions of reads, 10’s to 100’s gigabytes
of data.

• Primary concerns are to minimize time and


memory requirements.

• No guarantee on optimality of assembly quality


and in fact no optimality criterion at all.
Computational complexity view
• Formulate the assembly problem as a combinatorial
optimization problem:
– Shortest common superstring (Kececioglu-Myers 95)
– Maximum likelihood (Medvedev-Brudno 09)
– Hamiltonian path on overlap graph (Nagarajan-Pop 09)

• Typically NP-hard and even hard to approximate.

• Does not address the question of when the solution


reconstructs the ground truth.
Information theoretic view
Basic question:

What is the quality and quantity of read data


needed to reliably reconstruct?
Information theoretic approach
to assembly design
I. DNA assembly

a de novo DNA assembler from long, noisy


reads
II. RNA assembly
:
a de novo RNA-Seq assembler from short
reads
Challenges
Noisy reads
Long repeats
log(# of `-repeats)
16
15
16.5

16
14

15.5
12
15
10
10
14.5

8
14

13.5
6
5
13
4
12.5
2
12

0
11.5
0
0
20
2 40
500 4
60 80
1000 6
100 8 120
1500
140
10
2000
160 12
`

Human Chr 22 Illumina read error profile


repeat length histogram
Multiple sequence alignment
• Use flanking region as anchor to align reads close to boundary of approximate
repeats
• Average across reads to correct errors
• Bootstrap to extend further into the interior of repeat.

A C G T
Part II:

RNA Assembly
Central dogma of molecular
biology
transcription translation

DNA RNA Protein

RNA transcripts and their abundances capture the state


of a cell at a given time.
Alternative splicing

DNA AC TGAA AGC


ATC GAT CAT TCG
Exon Intron

1000’s to 10,000’s symbols long


ATC CAT TCG GAT TCG

RNA Transcript 1 RNA Transcript 2

Alternative splicing yields different isoforms.


Transcriptome

ATC CAT TCG 20 copies in cell

GAT TCG 30 copies in cell

• Different transcripts are present at different abundances.


• Transcriptome is the mixture of transcripts from all the genes.
• Human transcriptome has 10,000’s of transcripts from
20,000 genes.
(Mortazavi et al,
RNA-Seq Nature Methods 08)

Reads
ATC CAT TCG

ATC CAT TCG TTC

GAT TCG
GAT
GAT TCG

GAT TCG TCG


Ambiguity due to inter-transcript repeats

s1 s3 s4 transcript 1

L-1

s1 s3 s5 transcript 2

L-1
Ambiguity due to inter-transcript repeats

s1 s3 s5 transcript 1

L-1

s1 s3 s4 transcript 2

L-1
Abundance diversity

lymphoblastoid cell line


Geuvadis dataset
Assembly algorithm architecture

Multi-bridging
to resolve intra-transcript repeats

transcript graph

Min-cost network flow


to estimate aggregate abundance at exons

abundance estimates

Sparsest decomposition
to extract transcripts

transcriptome
Applications of DNA Sequencing
• Forensics: to help identify
individuals because each individual
has a different genetic sequence

• Medicine: can be used to help


detect the genes which are linked to
various genetic disorders such as
muscular dystrophy.

• Agriculture: The mapping and


sequencing of a genome of
microorganisms has helped to make
them useful for crops and food
plants.
• Advantages
• Improved diagnosis of disease
• Bio pesticides
• Identifying crime suspects

• Disadvantages
• Whole genome cannot be sequenced at once
• Very slow and time consuming
The Human Genome Project
• The biggest challenge for the life sciences

• 15 years project (NIH, DOE of USA)

• Primary goal  Sequence base pairs of human beings that form


DNA

• Identifying & mapping approx. 20K-25K genes

• Significance  Physical & functional


•standpoint
Thank You

You might also like