You are on page 1of 30

Fundamentals of Genome

Assembly

1
Genome
• Each nucleotide in DNA has fixed position.
• There are coding and non coding regions.
• There are regions in DNA with repeated
sequence of nucleotides
• Almost 45% sequence in human genome are
repeats
• The Alu sequence is approx. 300 bp long and
repeated many times in entire genome

2
Assembly Challenges
• Repetitive regions
• Segmental duplications
• Sequencing errors
• Gaps.

3
Reads
• DNA fragmented randomly, amplified
• Total ordering fragments unknown
• Its difficult to know where the genomes
We don’t know where the pieces (Reads) come from

5
We need to computationally infer where in the
read each genome has come from.

6
• The de Bruijn graph does not use the actual sequence reads for assembly, but breaks each sequence read
down to smaller sequences called k-mers. These k-mers are aligned using (k−1) sequence overlaps. The
actual size of k depends on sequence coverage, read length, etc., but usually is not less than half of the
actual read length. For example, a 106-base read can be divided into 49 overlapping 58-mers (sequence
read length−k-mer length+1=# of k-mers; hence, 106−58+1=49). Because breaking one sequence read
into k-mers increases the number of short sequence reads (e.g. just one 106-base read generates 49 k-
mers, each one 58 bases long), it is likely that the resulting k-mers generated from all sequence reads will
represent nearly all k-mers from the genome for sufficiently small k. This process seemingly compensates
for missing sequence reads—that is, the sequence reads that could not be generated through sequencing
for a variety of technical reasons.5 Therefore, computational application of the de Bruijn graph helps
alleviate many problems of de novo sequence assembly, but it is still not a fool-proof process.

7
Basic Assembly Principle
• Reads from same region in genome must have
similarity in sequence
GCATCTAGGGTTCAG
AGCTGCATCTAGGGT

• Reads having common suffix and prefix, give


some idea about the genome sequence
GCATCTAGGGTTCAG
AGCTGCATCTAGGGT
GTCTGGCCAGCTGCATCTAGGGTTCAGGACTG

8
Coverage
• Average number of reads covering a particular
position is genome. (why did we explode stack of newspaper?)

9
• Coverage = no.of reads× read length / Reference or
final genome size
• Good cover the genome entirely and gives
confidence that the read belongs to a particular
place in genome.
• Typically 50x coverage leads to confidence in
assembly and to walk across the genome
• Depends on size of genome, how repetitive it is and
Sequencing technique (why reads for nanopore
should have more coverage)

10
Assembly Approaches
• Long Reads (Pacbio, Nanpore)
- 100 – 150kb
- high error rate (5 – 15%)
- Key computational challenge overcome higher error rate
• Short Reads (Illumina)
- high throughput, short reads
- short read limits ability to resolve repeats
- Key computational challenge efficiently assemble
high number of short reads
https://en.wikipedia.org/wiki/List_of_sequence_alignment_software#Short-read_sequence_alignment
11
Long Read Assembly Pipeline
• Read – as input
• Overlap – Build overlap graph
• Layout – Bundle stretches of overlap into contigs
• Consensus – Pickup most likely nucleotide
sequence for each position in contigs
• Contigs – generate longest sequence from the
overlapping reads.

12
Overlap Layout Consensus Assember
(OLC)
• Edena is OLC based assembler
• Developed to assemble reads from Sanger
sequencing
• Work well with nanopore, pacbio reads

13
Overlaps
•Reads having significant
overlapping sequence are
paired
•Selection is done if suffix
of one graph matches with
the prefix of another
•Read pairs must have
significant overlapping
sequences (50% or more)
•Some mismatches,
addition &/or deletion may
be present on read pairs
14
Overlap Graph
Overlap graph constructed by connecting the reads
having significant overlap
through edges

15
Redundant edges are removed from the path. This
approach converts complex overlap graph into
simple and meaningful long stretch of sequence

16
Segments with no ambiguity, unbranched part of
graph are bundled together into contigs

Contig 1 Unresolved area Contig 2

17
Consensus
• Read errors reflect in contigs
• All the reads forming contigs are aligned
• Each base compared
• In case a base is misrepresented or if
substitution/ deletion observed (error in
sequencing, the correction is made based on
frequency of appearance.
• A consensus contig is generated

18
19
Short Read Assembly Pipeline
• Error Correction
• Graph Construction
• Graph Cleaning
• Contig Assembly
• Scaffolding
• Gap Filling

20
Error Correction
• Consider read
GTATCGACGTGGAGATACTTGGTATC
• Count number of times each k mer in the
reads is present in all reads
count (GATACTTGATATC) = 40
count (GATACTTGGTATC) = 1
• Correct sequence – GATACTTGATATC
• K mer based correctors are quake,
sga, bfc etc

21
de Bruijn graph construction
• Does not use the actual sequence reads, uses to smaller sequences
called k-mers obtained from reads
• Size of k depends on sequence coverage, read length (not less than
half of the read length)
• No. of k mers = sequence read length − k-mer length + 1
• k-mers are aligned using (k−1) sequence overlaps.
• Generation of k-mers increases number of short reads. For example a
106-base read can be divided into 49 overlapping 58-mers
• k-mers generated from all sequence reads represents nearly all k-
mers from the genome, hence increases chances of representation
for missing reads also.

22
• de Bruijn graph assemblers break reads into k mers
and link adjecent k mers with edges
• Every Eulerian path in the de Bruijn graph thus
constructed spells out String/ some stretch of
genome

23
Graph Cleaning
•There can spurs in the
graph arising from the
reads with sequencing
error, which are random
& appear towards end
of graph

Bubbles may appear


if diploid sample
used. Sequencer may
select anyone or take
the one with highest
coverage 24
Limitations
• de Bruijn graph-based assemblers give up on unresolvable repeats and
yield fragmented assemblies like OLC tools.
• Sequencing process maynot generate all reads representing the genome
inspite of high coverage. The de Bruijn graph resolves the problem by
breaking reads into k mers but leads to a more tangled de Bruijn graph,
making it difficult to infer the genome from this graph.
• Gaps in k-mer coverage leads to de Bruijn graph with missing edges, and
so the search for an Eulerian path fails.
• The assembler outputs contigs (long, contiguous segments of the
• genome) rather than entire chromosomes.
• For most genomes, the order of these contigs along the genome remains
unknown

25
Graph cleaning

Spurs are identified


and removed

One of the two allels


is selected to
represent the
genome
26
Contig Assembly

Cleaner graph
represents contigs
assembly wit some
unresolved area

27
Scaffolding
pair end read information
help in assembling the
contigs into scaffolds.

Scaffolds contain gaps


represented as NNN……
The vertex is the contig and
the edge is the read pair
connecting tpair of contigs
Unresolved area are either
repeatitive sequence or area
where the coverage is less
28
Gap Filling

• Can use local assembly to fill gaps


• Programs like sga gapfill, gapcloser
• Can fill gaps using data from other sequencer

29
https://www.youtube.com/watch?v=5wvGapmA5zM&t=1217s 30

You might also like