You are on page 1of 25

MEGAHIT: an ultra-fast single-node solution for large

and complex metagenomics assembly via succinct de


Bruijn graph

Dinghua Li 1 Chi-Man Liu2 Ruibang Luo3 et al

Seminar By

Khushboo Akhlaq

Dinghua Li , Chi-Man Liu, Ruibang Luo et al MEGAHIT:


(Presented an
By:ultra-fast
KHUSHBOO single-node
AKHLAQ)
solution for large and complex
Khushboometagenomics
Akhlaq 1assembly
/ 25
Outline

1 Metagenomics

2 Challenges

3 Motivation

4 Background Methodology

5 Methodology

6 Results

7 Summary

8 Conclusion

Dinghua Li , Chi-Man Liu, Ruibang Luo et al MEGAHIT:


(Presented an
By:ultra-fast
KHUSHBOO single-node
AKHLAQ)
solution for large and complex
Khushboometagenomics
Akhlaq 2assembly
/ 25
Metagenomics

Metagenomics data processing steps:

Dinghua Li , Chi-Man Liu, Ruibang Luo et al MEGAHIT:


(Presented an
By:ultra-fast
KHUSHBOO single-node
AKHLAQ)
solution for large and complex
Khushboometagenomics
Akhlaq 3assembly
/ 25
Challenges

Complex and large metagenomics data


Assembling Time
Cost
Storage memory

Dinghua Li , Chi-Man Liu, Ruibang Luo et al MEGAHIT:


(Presented an
By:ultra-fast
KHUSHBOO single-node
AKHLAQ)
solution for large and complex
Khushboometagenomics
Akhlaq 4assembly
/ 25
Motivation

What is needed?

An assembler with:
Minimum assembling time
Better Quality
Low Memory capacity

Dinghua Li , Chi-Man Liu, Ruibang Luo et al MEGAHIT:


(Presented an
By:ultra-fast
KHUSHBOO single-node
AKHLAQ)
solution for large and complex
Khushboometagenomics
Akhlaq 5assembly
/ 25
Background Methodology

Old assembly algorithms – Overlap Graphs.


Problem: Vast genomic data
Overcome:De Bruijn Graph

Dinghua Li , Chi-Man Liu, Ruibang Luo et al MEGAHIT:


(Presented an
By:ultra-fast
KHUSHBOO single-node
AKHLAQ)
solution for large and complex
Khushboometagenomics
Akhlaq 6assembly
/ 25
De Bruijn Graph

Dinghua Li , Chi-Man Liu, Ruibang Luo et al MEGAHIT:


(Presented an
By:ultra-fast
KHUSHBOO single-node
AKHLAQ)
solution for large and complex
Khushboometagenomics
Akhlaq 7assembly
/ 25
Methodology

Dinghua Li , Chi-Man Liu, Ruibang Luo et al MEGAHIT:


(Presented an
By:ultra-fast
KHUSHBOO single-node
AKHLAQ)
solution for large and complex
Khushboometagenomics
Akhlaq 8assembly
/ 25
Succinct de Bruijn Graph

Crosscheck edges obtained from dBG


Edge present = 1
Edge absent = 0
Efficient edge removal
2kt bits auxiliary vector

Dinghua Li , Chi-Man Liu, Ruibang Luo et al MEGAHIT:


(Presented an
By:ultra-fast
KHUSHBOO single-node
AKHLAQ)
solution for large and complex
Khushboometagenomics
Akhlaq 9assembly
/ 25
BWT-construction Algorithm

BWT - Block-sorting compression


Aligning the k-mers
Input (n)
Rotation of each row one
Sorting lexicographically

Dinghua Li , Chi-Man Liu, Ruibang Luo et al MEGAHIT:


(Presented an
By:ultra-fast
KHUSHBOO single-node
AKHLAQ)
solution for large andKhushboo
complex metagenomics
Akhlaq 10assembly
/ 25
How BWT works?

Dinghua Li , Chi-Man Liu, Ruibang Luo et al MEGAHIT:


(Presented an
By:ultra-fast
KHUSHBOO single-node
AKHLAQ)
solution for large andKhushboo
complex metagenomics
Akhlaq 11assembly
/ 25
Back Transformation

Dinghua Li , Chi-Man Liu, Ruibang Luo et al MEGAHIT:


(Presented an
By:ultra-fast
KHUSHBOO single-node
AKHLAQ)
solution for large andKhushboo
complex metagenomics
Akhlaq 12assembly
/ 25
Back Transformation

Dinghua Li , Chi-Man Liu, Ruibang Luo et al MEGAHIT:


(Presented an
By:ultra-fast
KHUSHBOO single-node
AKHLAQ)
solution for large andKhushboo
complex metagenomics
Akhlaq 13assembly
/ 25
Back Transformation

Dinghua Li , Chi-Man Liu, Ruibang Luo et al MEGAHIT:


(Presented an
By:ultra-fast
KHUSHBOO single-node
AKHLAQ)
solution for large andKhushboo
complex metagenomics
Akhlaq 14assembly
/ 25
CX1-GPU

CX1 - Splitters P0, P1, . . . , Pu


P0 - Minimum size
Pu - Maximum size

Dinghua Li , Chi-Man Liu, Ruibang Luo et al MEGAHIT:


(Presented an
By:ultra-fast
KHUSHBOO single-node
AKHLAQ)
solution for large andKhushboo
complex metagenomics
Akhlaq 15assembly
/ 25
What MEGAHIT do?

BWT algorithm - Sort Reverse Lexographically


Succint data structures - Zero indegree vertices
GPU - CX1 Multiple k-mer Strategy
Low depth regions - Mercy k-mer Strategy
Soil metagenomic data set - 252 billion bp

Dinghua Li , Chi-Man Liu, Ruibang Luo et al MEGAHIT:


(Presented an
By:ultra-fast
KHUSHBOO single-node
AKHLAQ)
solution for large andKhushboo
complex metagenomics
Akhlaq 16assembly
/ 25
Mercy k-mer Strategy

To cover low-depth edges.


Two solid (k+1)-mers x and y from same read, where x has no
outdegree and y has no indegree.

Dinghua Li , Chi-Man Liu, Ruibang Luo et al MEGAHIT:


(Presented an
By:ultra-fast
KHUSHBOO single-node
AKHLAQ)
solution for large andKhushboo
complex metagenomics
Akhlaq 17assembly
/ 25
Workflow

Dinghua Li , Chi-Man Liu, Ruibang Luo et al MEGAHIT:


(Presented an
By:ultra-fast
KHUSHBOO single-node
AKHLAQ)
solution for large andKhushboo
complex metagenomics
Akhlaq 18assembly
/ 25
Results

MEGAHIT MEGAHIT MEGAHIT SPAdes


100x 20x 10x 10x
N50 (bp) 73 736 52 352 9067 18 264
Largest alignm-
221k 178k 31k 62k
net (bp)
bp in contigs
4.55M 4.55M 4.52M 4.55M
>=1 kbp
Genome fraction 98.0% 98.1% 97.4% 97.9%
Misassemblies
2k 41k 81k 64k
(bp)
Wall time(s) 185 82 47 318

Dinghua Li , Chi-Man Liu, Ruibang Luo et al MEGAHIT:


(Presented an
By:ultra-fast
KHUSHBOO single-node
AKHLAQ)
solution for large andKhushboo
complex metagenomics
Akhlaq 19assembly
/ 25
Results

MEGAHIT Howe et al. Minia


Wall time (h) 44.1 >488 331.4
Peak memory
345 287 29
(GB)
Total size (Mbp) 4902 1503 1490
Average length
633 485 505
(bp)
N50 (bp) 657 471 488
Longest (bp) 184210 9397 32679
No of contigs 7 749 211 3 096 464 2 951 575
No of contigs
841 257 129 513 158 402
>=1 kbp

Dinghua Li , Chi-Man Liu, Ruibang Luo et al MEGAHIT:


(Presented an
By:ultra-fast
KHUSHBOO single-node
AKHLAQ)
solution for large andKhushboo
complex metagenomics
Akhlaq 20assembly
/ 25
Results

MEGAHIT Howe et al. Minia


Total of reads 3 252 369 195
Reads overall
55.81 10.72 13.03
aligned (%)
Total of SE
356 742 333
reads
SE aligned 1
37.00 8.72 12.38
time (%)
SE aligned >1
14.68 0.32 0.02
time (%)
Total of PE
1 447 813 431
reads
PE p.aligned 1
36.78 7.41 9.48
time (%)
PE p.aligned >1
8.90 0.20 0.01
time (%)
PE improperly
2.67 0.54 0.82
aligned (%)

Dinghua Li , Chi-Man Liu, Ruibang Luo et al MEGAHIT:


(Presented an
By:ultra-fast
KHUSHBOO single-node
AKHLAQ)
solution for large andKhushboo
complex metagenomics
Akhlaq 21assembly
/ 25
Summary

Generation of large NGS data


No reference genome - De novo assembly approach
Assembly - SdBG
Alignment - BWT CX1
Contigs coverage

Dinghua Li , Chi-Man Liu, Ruibang Luo et al MEGAHIT:


(Presented an
By:ultra-fast
KHUSHBOO single-node
AKHLAQ)
solution for large andKhushboo
complex metagenomics
Akhlaq 22assembly
/ 25
Conclusion

MEGAHIT: Three-times larger assembly


Longer N50 contigs
More than 4 times read mapped
Time - 44.1 hrs (7 times faster)
Time efficient and low memory storage
Available in both CPU and GPU accelerated version

Dinghua Li , Chi-Man Liu, Ruibang Luo et al MEGAHIT:


(Presented an
By:ultra-fast
KHUSHBOO single-node
AKHLAQ)
solution for large andKhushboo
complex metagenomics
Akhlaq 23assembly
/ 25
References

Dinghua Li, Chi-Man Liu, Ruibang Luo, Kunihiko Sadakane, Tak-Wah


Lam; MEGAHIT: an ultra-fast single-node solution for large and
complex metagenomics assembly via succinct de Bruijn graph.
Bioinformatics 2015; 31 (10): 1674-1676. doi:
10.1093/bioinformatics/btv033
Li, Heng, and Richard Durbin. ”Fast and accurate long-read
alignment with Burrows–Wheeler transform.” Bioinformatics 26.5
(2010): 589-595.
Liu, Chi-Man, Ruibang Luo, and Tak-Wah Lam. ”GPU-accelerated
BWT construction for large collection of short reads.” arXiv preprint
arXiv:1401.7457 (2014).

Dinghua Li , Chi-Man Liu, Ruibang Luo et al MEGAHIT:


(Presented an
By:ultra-fast
KHUSHBOO single-node
AKHLAQ)
solution for large andKhushboo
complex metagenomics
Akhlaq 24assembly
/ 25
Thank You

THANK YOU!!!

Dinghua Li , Chi-Man Liu, Ruibang Luo et al MEGAHIT:


(Presented an
By:ultra-fast
KHUSHBOO single-node
AKHLAQ)
solution for large andKhushboo
complex metagenomics
Akhlaq 25assembly
/ 25

You might also like