Professional Documents
Culture Documents
Assembly
1
Genome
• Each nucleotide in DNA has fixed position.
• There are coding and non coding regions.
• There are regions in DNA with repeated
sequence of nucleotides
• Almost 45% sequence in human genome are
repeats
• The Alu sequence is approx. 300 bp long and
repeated many times in entire genome
2
Assembly Challenges
• Repetitive regions
• Segmental duplications
• Sequencing errors
• Gaps.
3
Reads
• DNA fragmented randomly, amplified
• Total ordering fragments unknown
• Its difficult to know where the genomes
We don’t know where the pieces (Reads) come from
5
We need to computationally infer where in the
read each genome has come from.
6
• The de Bruijn graph does not use the actual sequence reads for assembly, but breaks each sequence read
down to smaller sequences called k-mers. These k-mers are aligned using (k−1) sequence overlaps. The
actual size of k depends on sequence coverage, read length, etc., but usually is not less than half of the
actual read length. For example, a 106-base read can be divided into 49 overlapping 58-mers (sequence
read length−k-mer length+1=# of k-mers; hence, 106−58+1=49). Because breaking one sequence read
into k-mers increases the number of short sequence reads (e.g. just one 106-base read generates 49 k-
mers, each one 58 bases long), it is likely that the resulting k-mers generated from all sequence reads will
represent nearly all k-mers from the genome for sufficiently small k. This process seemingly compensates
for missing sequence reads—that is, the sequence reads that could not be generated through sequencing
for a variety of technical reasons.5 Therefore, computational application of the de Bruijn graph helps
alleviate many problems of de novo sequence assembly, but it is still not a fool-proof process.
7
Basic Assembly Principle
• Reads from same region in genome must have
similarity in sequence
GCATCTAGGGTTCAG
AGCTGCATCTAGGGT
8
Coverage
• Average number of reads covering a particular
position is genome. (why did we explode stack of newspaper?)
9
• Coverage = no.of reads× read length / Reference or
final genome size
• Good cover the genome entirely and gives
confidence that the read belongs to a particular
place in genome.
• Typically 50x coverage leads to confidence in
assembly and to walk across the genome
• Depends on size of genome, how repetitive it is and
Sequencing technique (why reads for nanopore
should have more coverage)
10
Assembly Approaches
• Long Reads (Pacbio, Nanpore)
- 100 – 150kb
- high error rate (5 – 15%)
- Key computational challenge overcome higher error rate
• Short Reads (Illumina)
- high throughput, short reads
- short read limits ability to resolve repeats
- Key computational challenge efficiently assemble
high number of short reads
https://en.wikipedia.org/wiki/List_of_sequence_alignment_software#Short-read_sequence_alignment
11
Long Read Assembly Pipeline
• Read – as input
• Overlap – Build overlap graph
• Layout – Bundle stretches of overlap into contigs
• Consensus – Pickup most likely nucleotide
sequence for each position in contigs
• Contigs – generate longest sequence from the
overlapping reads.
12
Overlap Layout Consensus Assember
(OLC)
• Edena is OLC based assembler
• Developed to assemble reads from Sanger
sequencing
• Work well with nanopore, pacbio reads
13
Overlaps
•Reads having significant
overlapping sequence are
paired
•Selection is done if suffix
of one graph matches with
the prefix of another
•Read pairs must have
significant overlapping
sequences (50% or more)
•Some mismatches,
addition &/or deletion may
be present on read pairs
14
Overlap Graph
Overlap graph constructed by connecting the reads
having significant overlap
through edges
15
Redundant edges are removed from the path. This
approach converts complex overlap graph into
simple and meaningful long stretch of sequence
16
Segments with no ambiguity, unbranched part of
graph are bundled together into contigs
17
Consensus
• Read errors reflect in contigs
• All the reads forming contigs are aligned
• Each base compared
• In case a base is misrepresented or if
substitution/ deletion observed (error in
sequencing, the correction is made based on
frequency of appearance.
• A consensus contig is generated
18
19
Short Read Assembly Pipeline
• Error Correction
• Graph Construction
• Graph Cleaning
• Contig Assembly
• Scaffolding
• Gap Filling
20
Error Correction
• Consider read
GTATCGACGTGGAGATACTTGGTATC
• Count number of times each k mer in the
reads is present in all reads
count (GATACTTGATATC) = 40
count (GATACTTGGTATC) = 1
• Correct sequence – GATACTTGATATC
• K mer based correctors are quake,
sga, bfc etc
21
de Bruijn graph construction
• Does not use the actual sequence reads, uses to smaller sequences
called k-mers obtained from reads
• Size of k depends on sequence coverage, read length (not less than
half of the read length)
• No. of k mers = sequence read length − k-mer length + 1
• k-mers are aligned using (k−1) sequence overlaps.
• Generation of k-mers increases number of short reads. For example a
106-base read can be divided into 49 overlapping 58-mers
• k-mers generated from all sequence reads represents nearly all k-
mers from the genome, hence increases chances of representation
for missing reads also.
22
• de Bruijn graph assemblers break reads into k mers
and link adjecent k mers with edges
• Every Eulerian path in the de Bruijn graph thus
constructed spells out String/ some stretch of
genome
23
Graph Cleaning
•There can spurs in the
graph arising from the
reads with sequencing
error, which are random
& appear towards end
of graph
25
Graph cleaning
Cleaner graph
represents contigs
assembly wit some
unresolved area
27
Scaffolding
pair end read information
help in assembling the
contigs into scaffolds.
29
https://www.youtube.com/watch?v=5wvGapmA5zM&t=1217s 30