Professional Documents
Culture Documents
KrakenHLL: Confident and Fast Metagenomics Classification Using Unique K-Mer Counts
KrakenHLL: Confident and Fast Metagenomics Classification Using Unique K-Mer Counts
Muhammad Hamid
08.10.2018
Outline
1 Challenges
2 Motivation
3 Methodology
4 Results
5 Conclusion
Taxonomic profiling
Taxonomic profiling
Assembly-based method
Taxonomic profiling
Assembly-based method
False positive reads
Taxonomic profiling
Assembly-based method
False positive reads
Contamination
Taxonomic profiling
Assembly-based method
False positive reads
Contamination
Less than 0.1% of the DNA sequenced
561
3565
4645
131567
2157
5678
1368
548
1950
14015
3 001xxxxxxxx..x 0.125 8
4 0001xxxxxxx..x 0.0625 16
5 00001xxxxxx..x 0.03125 32
Table 1: (Supplementary) Probabilities observing the first 1-bit in a random bit string.
KrakenHLL: Confident and fast metagenomics classification using
Muhammad
unique k-mer
Hamid
counts 20
HyperLogLog algorithm
3 001xxxxxxxx..x 0.125 8
4 0001xxxxxxx..x 0.0625 16
5 00001xxxxxx..x 0.03125 32
Table 1: (Supplementary) Probabilities observing the first 1-bit in a random bit string.
KrakenHLL: Confident and fast metagenomics classification using
Muhammad
unique k-mer
Hamid
counts 21
Cardinality estimation with HLL algorithm
1 00000 2 00001 3 00010 4 00011 5 00100 32 11111
Hash
MurMurHash3
Sparse representation
n << m
Standard Representation
p 64-p = q
Hash (H)
index i into the Defined the rank based on the
registers M position of first 1-bit
Sketch
Figure 5: Using unique k-mers as thresholds instead of reads can give higher F1 scores.
KrakenHLL: Confident and fast metagenomics classification using
Muhammad
unique k-mer
Hamid
counts 30
Results on simulated and biological data
Figure 6: Unique k-mer counts separate true identifications better from false ones.
KrakenHLL: Confident and fast metagenomics classification using
Muhammad
unique k-mer
Hamid
counts 31
Infectious disease diagnosis
HyperLogLog algorithm:
F1 score:
1. Breitwieser, F.P., Lu, J. and Salzberg, S.L. A review of methods and databases for metagenomic classification and
assembly. Brief Bioinform 2017.
2. Brown, J.R., Bharucha, T. and Breuer, J. Encephalitis diagnosis using metagenomics: application of next
generation sequencing for undiagnosed cases. Journal of Infection 2018.
3. Dadi TH, Renard BY, Wieler LH, Semmler T, Reinert K: SLIMM: species level identification of microorganisms from
metagenomes. PeerJ 2017, 5:e3138.
4. Demo of metagenomic classification using KRAKEN - FROM READS TO RESULTS | Coursera. (n.d.). Retrieved from
https://www.coursera.org/lecture/metagenomics/demo-of-metagenomic-classification-using-kraken-7GIn6
5. Ertl O: New Cardinality Estimation Methods for HyperLogLog Sketches. arXiv:170607290 2017.
6. Flajolet P, Fusy É, Gandouet O, Meunier F: HyperLogLog: the analysis of a near optimal cardinality estimation
algorithm. In AofA: Analysis of Algorithms; 2007-06-17; Juan les Pins, France. Discrete Mathematics and
Theoretical Computer Science; 2007: 137-156.
7. Heule S, Nunkesser M, Hall A: HyperLogLog in practice. 2013:683.
8. Kraken 2 Manual –. (n.d.). Retrieved from http://ccb.jhu.edu/software/kraken/MANUAL.html#classification
9. Mukherjee S, Huntemann M, Ivanova N, Kyrpides NC, Pati A: Large-scale contamination of microbial isolate
genomes by Illumina PhiX control. Stand Genomic Sci 2015, 10:18.
10. Quince C, Walker AW, Simpson JT, Loman NJ, Segata N: Shotgun metagenomics, from sampling to analysis. Nat
Biotechnol 2017, 35:833-844.
11. Salter, S.J., et al. Reagent and laboratory contamination can critically impact sequence-based microbiome
analyses. BMC Biol. 2014;12:87.
12. Wood, D. E., & Salzberg, S. L. (2014). Kraken: ultrafast metagenomic sequence classification using exact
alignments. Genome Biology, 15(3), R46. doi:10.1186/gb-2014-15-3-r46