Professional Documents
Culture Documents
©Kurt Stüber
31.8.2016
Bioinformatics Lectures 02 Kurt Stueber
AAGCTTGAGCCgcttGAGA
Bioinformatics Lectures 02 Kurt Stueber
2
AAGCT CGCTT GCTTG
2
AGCCG CTTGA TGAGA
Simplifications:
1. remove k-mers
CGCT CTTG
CCGC
Overlaps
GCCG TTGA
AGCC GAGC
TGAG
GAGA
Bioinformatics Lectures 02 Kurt Stueber
Simplifications:
2. fuse unambigous overlaps
CGCT CTTG
CCGC
Overlaps
GCCG TTGA
AGCC GAGC
TGAG
GAGA
Bioinformatics Lectures 02 Kurt Stueber
Simplifications:
2. fuse unambigous overlaps
AAGCT GCTT
CGCT CTTG
CCGC
GCCG TTGA
AGCC GAGC
TGAG
GAGA
Bioinformatics Lectures 02 Kurt Stueber
Simplifications:
2. fuse unambigous overlaps
AAGCT GCTT
CCGCT CTTG
GCCG TTGA
AGCC GAGC
TGAG
GAGA
Bioinformatics Lectures 02 Kurt Stueber
Simplifications:
2. fuse unambigous overlaps
AAGCT GCTT
GCCGCT CTTG
TTGA
AGCC GAGC
TGAG
GAGA
Bioinformatics Lectures 02 Kurt Stueber
Simplifications:
2. fuse unambigous overlaps
AAGCT GCTT
AGCCGCT CTTG
TTGA
GAGC
TGAG
GAGA
Bioinformatics Lectures 02 Kurt Stueber
Simplifications:
2. fuse unambigous overlaps
AAGCT GCTT
GAGCCGCT CTTG
TTGA
TGAG
GAGA
Bioinformatics Lectures 02 Kurt Stueber
Simplifications:
2. fuse unambigous overlaps
AAGCT GCTT
GAGCCGCT CTTGA
TGAG
GAGA
Bioinformatics Lectures 02 Kurt Stueber
Simplifications:
3. multipe branches linking the same nodes can be replaced by
single weighted branches.
AAGCT GCTT
2
GAGCCGCT CTTGA
TGAG
GAGA
Bioinformatics Lectures 02 Kurt Stueber
Simplifications:
2. fuse unambigous overlaps
1
AAGCT GCTT
2
Start
5 6
GAGCCGCT CTTGA
4 7 3
Stop
8 TGAG
GAGA
Bioinformatics Lectures 02 Kurt Stueber
Simplifications:
2. fuse unambigous overlaps
AAGCT GCTTGA
Start
GAGCCGCT
Stop
TGAG
GAGA
Bioinformatics Lectures 02 Kurt Stueber
Simplifications:
2. fuse unambigous overlaps
1
AAGCT GCTTGAG
Start
3 2
4
GAGCCGCT
GAGA
Stop
Bioinformatics Lectures 02 Kurt Stueber
GAGCCGCT
GAGCCGCT
=
AAGCTTGAGCCgcttGAGA
Bioinformatics Lectures 02 Kurt Stueber
AAGCTTGAGCCgcttGAGACTTGCTTGAGTAA
GAGACTTGCT
GAGCCGCT
Bioinformatics Lectures 02 Kurt Stueber
AAGCTTGAGCCgcttGAGACTTGCTTGAGTAA
GAGACTTGCT
2 3
1 6
AAGCT GCTTGAG GAGTAA
5 4
GAGCCGCT
Bioinformatics Lectures 02 Kurt Stueber
GAGACTTGCT
2 3
1 6
AAGCT GCTTGAG GAGTAA
5 4
GAGCCGCT
=
AAGCTTGAGCCgcttGAGACTTGCTTGAGTAA
or
AAGCTTGAGACTTGCTTGAGCCGCTTGAGTAA
Bioinformatics Lectures 02 Kurt Stueber
AAGCTTGAGCCgcttGAGA
AAGCTTGAGC AGCCGCTTGA
AGCTTGAGCC GCCGCTTGAG
GCTTGAGCCG CCGCTTGAGA
CTTGAGCCGC
TTGAGCCGCT
TGAGCCGCTT
GAGCCGCTTG
Bioinformatics Lectures 02 Kurt Stueber
Now all oligmers are unique since they are longer than the
repeats and the reconstruction of the original sequence is
straightforward.
CCGCTTGAG CGCTTGAGA
Bioinformatics Lectures 02 Kurt Stueber
Error correction:
During sequencing the reads produced often have errors, i.e.
wrong bases which have not been correctly assigned.
TTGAGCCGCA
TGAGCCGCAG
Bioinformatics Lectures 02 Kurt Stueber
These errors are often found on dead end side branches of the
de Bruijn graph and can be cut off.
TTGAGCCGCA
TGAGCCGCAG
Bioinformatics Lectures 02 Kurt Stueber
Summary
Using longer k-mers helps with the assembly but the length is limited
due to the sequencing technologies.
The graph also helps to find and eliminate errors made during
sequencing.
Bioinformatics Lectures 02 Kurt Stueber
Links
Exercise 1
Complete the non-
1 5 10 simplyfied de Bruijn
AGGTCGTCGAGG graph of the sequence
given. The k-mer length
is 4 bases with 3 bases
overlap.
AGG
AGG
Kurt Stueber
Bioinformatics Lectures 02
TGGCTAAT