You are on page 1of 30

Bioinformatics Lectures 02

DNA de Bruijn Graph

©Kurt Stüber
31.8.2016
Bioinformatics Lectures 02 Kurt Stueber

A given DNA sequence has two


repeats:

AAGCTTGAGCCgcttGAGA
Bioinformatics Lectures 02 Kurt Stueber

The DNA sequence is dissected into oligomers of


length 5
AAGCTTGAGCCgcttGAGA
AAGCT GCCGC
AGCTT CCGCT
GCTTG CGCTT
CTTGA GCTTG
TTGAG CTTGA
TGAGC TTGAG
GAGCC TGAGA
AGCCG
Bioinformatics Lectures 02 Kurt Stueber

There are 12 different oligomers and 3 of them


are found twice

2
AAGCT CGCTT GCTTG
2
AGCCG CTTGA TGAGA

AGCTT GAGCC TGAGC


2
CCGCT GCCGC TTGAG
Bioinformatics Lectures 02 Kurt Stueber

The oligomers can be linked by 4 base overlaps,


resulting in a looped graph, the de Bruijn graph:

AAGC AAGCT AGCT AGCTT GCTT

CGCT k-mers CTTG


CCGC
CTTGA CTTGA
GCCGC
Overlaps
GCCG TTGA
TTGAG TTGAG
AGCC GAGCC GAGC
TGAG
GAGA
Bioinformatics Lectures 02 Kurt Stueber

Simplifications:
1. remove k-mers

AAGC AGCT GCTT

CGCT CTTG
CCGC

Overlaps
GCCG TTGA
AGCC GAGC
TGAG
GAGA
Bioinformatics Lectures 02 Kurt Stueber

Simplifications:
2. fuse unambigous overlaps

AAGC AGCT GCTT

CGCT CTTG
CCGC

Overlaps
GCCG TTGA
AGCC GAGC
TGAG
GAGA
Bioinformatics Lectures 02 Kurt Stueber

Simplifications:
2. fuse unambigous overlaps

AAGCT GCTT

CGCT CTTG
CCGC

GCCG TTGA
AGCC GAGC
TGAG
GAGA
Bioinformatics Lectures 02 Kurt Stueber

Simplifications:
2. fuse unambigous overlaps

AAGCT GCTT

CCGCT CTTG

GCCG TTGA
AGCC GAGC
TGAG
GAGA
Bioinformatics Lectures 02 Kurt Stueber

Simplifications:
2. fuse unambigous overlaps

AAGCT GCTT

GCCGCT CTTG

TTGA
AGCC GAGC
TGAG
GAGA
Bioinformatics Lectures 02 Kurt Stueber

Simplifications:
2. fuse unambigous overlaps

AAGCT GCTT

AGCCGCT CTTG

TTGA
GAGC
TGAG
GAGA
Bioinformatics Lectures 02 Kurt Stueber

Simplifications:
2. fuse unambigous overlaps

AAGCT GCTT

GAGCCGCT CTTG

TTGA

TGAG
GAGA
Bioinformatics Lectures 02 Kurt Stueber

Simplifications:
2. fuse unambigous overlaps

AAGCT GCTT

GAGCCGCT CTTGA

TGAG
GAGA
Bioinformatics Lectures 02 Kurt Stueber

Simplifications:
3. multipe branches linking the same nodes can be replaced by
single weighted branches.

AAGCT GCTT
2

GAGCCGCT CTTGA

TGAG
GAGA
Bioinformatics Lectures 02 Kurt Stueber

Simplifications:
2. fuse unambigous overlaps
1
AAGCT GCTT
2
Start
5 6
GAGCCGCT CTTGA

4 7 3

Stop
8 TGAG
GAGA
Bioinformatics Lectures 02 Kurt Stueber

Simplifications:
2. fuse unambigous overlaps

AAGCT GCTTGA
Start

GAGCCGCT

Stop
TGAG
GAGA
Bioinformatics Lectures 02 Kurt Stueber

Simplifications:
2. fuse unambigous overlaps
1
AAGCT GCTTGAG
Start
3 2
4

GAGCCGCT

GAGA
Stop
Bioinformatics Lectures 02 Kurt Stueber

To reconstruct the original sequence find the


eulerian path:
1 4
AAGCT GCTTGAG GAGA
Start Stop
3 2

GAGCCGCT

An eulerian path passes once through every


edge (branch) of the graph.
Bioinformatics Lectures 02 Kurt Stueber

To reconstruct the original sequence find the


eulerian path:
1 4
AAGCT GCTTGAG GAGA
Start Stop
3 2

GAGCCGCT

=
AAGCTTGAGCCgcttGAGA
Bioinformatics Lectures 02 Kurt Stueber

Similarly it is possible to build the de Bruijn graph for a sequence


with three repeats.

AAGCTTGAGCCgcttGAGACTTGCTTGAGTAA

GAGACTTGCT

AAGCT GCTTGAG GAGTAA

GAGCCGCT
Bioinformatics Lectures 02 Kurt Stueber

Two solutions are possible for the reconstruction of the original


sequence: 1-2-3-4-5-6 or 1-4-5-2-3-6, depending on which loop
in the graph is followed first.

AAGCTTGAGCCgcttGAGACTTGCTTGAGTAA

GAGACTTGCT
2 3
1 6
AAGCT GCTTGAG GAGTAA
5 4

GAGCCGCT
Bioinformatics Lectures 02 Kurt Stueber

Two solutions are possible for the reconstruction of the original


sequence: 1-2-3-4-5-6 or 1-4-5-2-3-6, depending on which loop
in the graph is followed first.

GAGACTTGCT
2 3
1 6
AAGCT GCTTGAG GAGTAA
5 4

GAGCCGCT
=
AAGCTTGAGCCgcttGAGACTTGCTTGAGTAA
or
AAGCTTGAGACTTGCTTGAGCCGCTTGAGTAA
Bioinformatics Lectures 02 Kurt Stueber

The DNA sequence could also be dissected into longer oligomers


for instance of length 10 (k-mer length 10)

AAGCTTGAGCCgcttGAGA

AAGCTTGAGC AGCCGCTTGA
AGCTTGAGCC GCCGCTTGAG
GCTTGAGCCG CCGCTTGAGA
CTTGAGCCGC
TTGAGCCGCT
TGAGCCGCTT
GAGCCGCTTG
Bioinformatics Lectures 02 Kurt Stueber

Now all oligmers are unique since they are longer than the
repeats and the reconstruction of the original sequence is
straightforward.

AAGCTTGAG AGCTTGAGC GCTTGAGCC

CTTGAGCCG TTGAGCCGC TGAGCCGCT

GAGCCGCTT AGCCGCTTG GCCGCTTGA

CCGCTTGAG CGCTTGAGA
Bioinformatics Lectures 02 Kurt Stueber

Error correction:
During sequencing the reads produced often have errors, i.e.
wrong bases which have not been correctly assigned.

CTTGAGCCGC TTGAGCCGCT TGAGCCGCTT

TTGAGCCGCA

TGAGCCGCAG
Bioinformatics Lectures 02 Kurt Stueber

These errors are often found on dead end side branches of the
de Bruijn graph and can be cut off.

CTTGAGCCGC TTGAGCCGCT TGAGCCGCTT

TTGAGCCGCA

TGAGCCGCAG
Bioinformatics Lectures 02 Kurt Stueber

Summary

De Bruijn graphs can be used to reconstruct (assemble) a DNA sequence


from smaller fragments.

The reconstruction will not always be unique if the sequence contains


repeats longer than the oligomer (or k-mer) length used.

Using longer k-mers helps with the assembly but the length is limited
due to the sequencing technologies.

The graph also helps to find and eliminate errors made during
sequencing.
Bioinformatics Lectures 02 Kurt Stueber

Links

Wikipedia entry for „de Bruijn graph“:


https://en.wikipedia.org/wiki/De_Bruijn_graph
Python implementation:
https://gist.github.com/BenLangmead/5298132
De Bruijn Graph assembly written by Ben Langmead:
http://www.cs.jhu.edu/~langmea/resources/lecture_notes/a
ssembly_dbg.pdf
Kurt Stueber
Bioinformatics Lectures 02

Exercise 1
Complete the non-
1 5 10 simplyfied de Bruijn
AGGTCGTCGAGG graph of the sequence
given. The k-mer length
is 4 bases with 3 bases
overlap.

AGG

AGG
Kurt Stueber
Bioinformatics Lectures 02

Exercise 2 Reconstruct the original


sequence from this
simplyfied de Bruijn
TAATCGT graph with 4 Bases
overlap.

AGGTCGT TCGTAAT TAATGGC TGGCTTT

TGGCTAAT

You might also like