You are on page 1of 42

KrakenHLL: Confident and fast metagenomics

classification using unique k-mer counts

Breitwieser FP1 and Salzberg SL1,2


Seminar by

Muhammad Hamid

08.10.2018
Outline

1 Challenges
2 Motivation
3 Methodology
4 Results
5 Conclusion

KrakenHLL: Confident and fast metagenomics classification using


Muhammad
unique k-mer
Hamid
counts 2
Challenges

 Taxonomic profiling

KrakenHLL: Confident and fast metagenomics classification using


Muhammad
unique k-mer
Hamid
counts 3
Challenges

 Taxonomic profiling
 Assembly-based method

KrakenHLL: Confident and fast metagenomics classification using


Muhammad
unique k-mer
Hamid
counts 4
Challenges

 Taxonomic profiling
 Assembly-based method
 False positive reads

KrakenHLL: Confident and fast metagenomics classification using


Muhammad
unique k-mer
Hamid
counts 5
Challenges

 Taxonomic profiling
 Assembly-based method
 False positive reads
 Contamination

KrakenHLL: Confident and fast metagenomics classification using


Muhammad
unique k-mer
Hamid
counts 6
Challenges

 Taxonomic profiling
 Assembly-based method
 False positive reads
 Contamination
 Less than 0.1% of the DNA sequenced

KrakenHLL: Confident and fast metagenomics classification using


Muhammad
unique k-mer
Hamid
counts 7
Motivation

 Identification and filtration of false positive reads


using:
– Minimum time
– Low memory
– Better Quality

KrakenHLL: Confident and fast metagenomics classification using


Muhammad
unique k-mer
Hamid
counts 8
KrakenHLL
KrakenHLL

Classification + Cardinality Estimation


Improvements:
1. Searches can be done against multiple databases
2. Taxonomy can be extended to nodes to include
strains and plasmids
3. Database build script can add over 100 thousand
viral strains from NCBI Viral Genome Resource
KrakenHLL is superset of Kraken

KrakenHLL: Confident and fast metagenomics classification using


Muhammad
unique k-mer
Hamid
counts 10
Methodology

Kraken HyperLogLog KrakenHLL

Assigns taxonomic Identify unique k-mer Metagenomic


labels to counts using classifier that
metagenomic DNA probabilistic combines Kraken
sequences. cardinality estimator. and HLL.

KrakenHLL: Confident and fast metagenomics classification using


Muhammad
unique k-mer
Hamid
counts 11
Methodology

Kraken HyperLogLog KrakenHLL

Assigns taxonomic Identify unique k-mer Metagenomic


labels to counts using classifier that
metagenomic DNA probabilistic combines Kraken
sequences. cardinality estimator. and HLL.

KrakenHLL: Confident and fast metagenomics classification using


Muhammad
unique k-mer
Hamid
counts 12
Kraken Database

Genomes K-mers Taxonomic IDs

561
3565
4645
131567
2157
5678
1368
548
1950
14015

Figure 1: Kraken database


KrakenHLL: Confident and fast metagenomics classification using
Muhammad
unique k-mer
Hamid
counts 13
Kraken Classification

Figure 2: The Kraken sequence classification algorithm.


Kraken: ultrafast metagenomic sequence classification using exact alignments
KrakenHLL: Confident and fast metagenomics classification using
Muhammad
unique k-mer
Hamid
counts 14
Kraken Classification

Figure 2: The Kraken sequence classification algorithm.


Kraken: ultrafast metagenomic sequence classification using exact alignments
KrakenHLL: Confident and fast metagenomics classification using
Muhammad
unique k-mer
Hamid
counts 15
Kraken algorithm summary

 Chop genomes into k-mers and link to a taxonomic id


 Chop read into k-mers and search for exact hits in
database
 Search for highest-weighted RTL paths and assign the
taxonomic id of the lowest node to read

KrakenHLL: Confident and fast metagenomics classification using


Muhammad
unique k-mer
Hamid
counts 16
Kraken Commands
# Building the database
kraken-build --db standardDB --threads 10 –standard

# Classify each read individually


kraken --threads 4 --paired R1.fq R2.fq > OUT.kra

# Add labels to classification


kraken-translate OUT.kra > OUT.kraken.txt

# Generate an aggregate report


kraken-report OUT.kra > OUT.report.txt

# Multi sample report


kraken-mpa-report --header *.kra > project_report.txt

KrakenHLL: Confident and fast metagenomics classification using


Muhammad
unique k-mer
Hamid
counts 17
Methodology

Kraken HyperLogLog KrakenHLL

Assigns taxonomic Identify unique k-mer Metagenomic


labels to counts using classifier that
metagenomic DNA probabilistic combines Kraken
sequences. cardinality estimator. and HLL.

KrakenHLL: Confident and fast metagenomics classification using


Muhammad
unique k-mer
Hamid
counts 18
HyperLogLog

What it can do What it can’t do


 Counting unique elements  Give an exact count
 Calculates an approximate  Track frequency of
number occurrence
 Typical error less than 2%  Confirm whether a certain
element was seen

KrakenHLL: Confident and fast metagenomics classification using


Muhammad
unique k-mer
Hamid
counts 19
HyperLogLog algorithm

General idea: Count leading zeros in a randomly


generated binary number.
k Patternk Pk Ek
1 1xxxxxxxxxx..x 0.5 2
2 01xxxxxxxxx..x 0.25 4

3 001xxxxxxxx..x 0.125 8

4 0001xxxxxxx..x 0.0625 16

5 00001xxxxxx..x 0.03125 32

l 0l-11xn-l 2-l, 0.5l, 1/2l 2l, 1/pl

Table 1: (Supplementary) Probabilities observing the first 1-bit in a random bit string.
KrakenHLL: Confident and fast metagenomics classification using
Muhammad
unique k-mer
Hamid
counts 20
HyperLogLog algorithm

General idea: Count leading zeros in a randomly


generated binary number.
k Patternk Pk Ek
1 1xxxxxxxxxx..x 0.5 2
2 01xxxxxxxxx..x 0.25 4

3 001xxxxxxxx..x 0.125 8

4 0001xxxxxxx..x 0.0625 16

5 00001xxxxxx..x 0.03125 32

l 0l-11xn-l 2-l, 0.5l, 1/2l 2l, 1/pl

Table 1: (Supplementary) Probabilities observing the first 1-bit in a random bit string.
KrakenHLL: Confident and fast metagenomics classification using
Muhammad
unique k-mer
Hamid
counts 21
Cardinality estimation with HLL algorithm
1 00000 2 00001 3 00010 4 00011 5 00100 32 11111

 2p one byte registers = m


 Relative error = 1/ √2p or 2-p/2
P M Space (kB) Rel. Error
10 1024 1 3.25%
11 2048 2 2.23%
12 4096 4 1.63%
13 8192 8 1.15%
14 16384 16 0.81%
15 32768 32 0.57%
16 65536 64 0.41%
17 131072 128 0.29%
18 262144 256 0.20%
25 0.02%

Table 2: Cardinality estimation on randomly sampled microbial k-mers using HyperLogLog.


KrakenHLL: Confident and fast metagenomics classification using
Muhammad
unique k-mer
Hamid
counts 22
Cardinality estimation with HLL algorithm
1 00000 2 00001 3 00010 4 00011 5 00100 32 11111

 2p one byte registers = m


 Relative error = 1/ √2p or 2-p/2
P M Space (kB) Rel. Error
10 1024 1 3.25%
11 2048 2 2.23%
12 4096 4 1.63%
13 8192 8 1.15%
14 16384 16 0.81%
15 32768 32 0.57%
16 65536 64 0.41%
17 131072 128 0.29%
18 262144 256 0.20%
25 0.02%

Table 2: Cardinality estimation on randomly sampled microbial k-mers using HyperLogLog.


KrakenHLL: Confident and fast metagenomics classification using
Muhammad
unique k-mer
Hamid
counts 23
Generating the Sketch

Hash
MurMurHash3

Sparse representation
n << m

Standard Representation

p 64-p = q
Hash (H)
index i into the Defined the rank based on the
registers M position of first 1-bit

Sketch

KrakenHLL: Confident and fast metagenomics classification using


Muhammad
unique k-mer
Hamid
counts 24
Methodology

Kraken HyperLogLog KrakenHLL

Assigns taxonomic Identify unique k-mer Metagenomic


labels to counts using classifier that
metagenomic DNA probabilistic combines Kraken
sequences. cardinality estimator. and HLL.

KrakenHLL: Confident and fast metagenomics classification using


Muhammad
unique k-mer
Hamid
counts 25
Results

Figure 3: KrakenHLL algorithm and report.


KrakenHLL: Confident and fast metagenomics classification using
Muhammad
unique k-mer
Hamid
counts 26
Performance and Memory Usage

P M Space Rel. Error


(kB) (%)
10 1024 1 3.25
11 2048 2 2.23
12 4096 4 1.63
13 8192 8 1.15
14 16384 16 0.81
15 32768 32 0.57
16 65536 64 0.41
17 131072 128 0.29
18 262144 256 0.20
25 0.02

Figure 4: Cardinality estimation on randomly sampled microbial k-mers using HyperLogLog.


KrakenHLL: Confident and fast metagenomics classification using
Muhammad
unique k-mer
Hamid
counts 27
Performance and Memory Usage

P M Space Rel. Error


(kB) (%)
10 1024 1 3.25
11 2048 2 2.23
12 4096 4 1.63
13 8192 8 1.15
14 16384 16 0.81
15 32768 32 0.57
16 65536 64 0.41
17 131072 128 0.29
18 262144 256 0.20
25 0.02

Figure 4: Cardinality estimation on randomly sampled microbial k-mers using HyperLogLog.


KrakenHLL: Confident and fast metagenomics classification using
Muhammad
unique k-mer
Hamid
counts 28
Results on simulated and biological data

 Simulated dataset do not represent biological data


 Tested KrakenHLL on 10 biological and 21 simulated
datasets
 K-mer performed better result than read counts

Main measure for comparison is the


max F1 score

KrakenHLL: Confident and fast metagenomics classification using


Muhammad
unique k-mer
Hamid
counts 29
Results on simulated and biological data

Figure 5: Using unique k-mers as thresholds instead of reads can give higher F1 scores.
KrakenHLL: Confident and fast metagenomics classification using
Muhammad
unique k-mer
Hamid
counts 30
Results on simulated and biological data

Figure 6: Unique k-mer counts separate true identifications better from false ones.
KrakenHLL: Confident and fast metagenomics classification using
Muhammad
unique k-mer
Hamid
counts 31
Infectious disease diagnosis

 Unique k-mer count metric can be used to rank and


identify pathogens.

Sample Name Reads K-mers Bases


PT5 Human polyomavirus 2 9650 7129* 5130
PT7 Elizabethkingia genomosp. 3 403 20724 52921
PT8 Mycobacterium tuberculosis 15 1570 2201
PT10 Human gammaherpesvirus 4 20 2084 2780

Table 3: Pathogen identifications in patients with suspected neurological infections.


KrakenHLL: Confident and fast metagenomics classification using
Muhammad
unique k-mer
Hamid
counts 32
Infectious disease diagnosis

 Unique k-mer count metric can be used to rank and


identify pathogens

Sample Name Reads K-mers Bases


PT5 Human polyomavirus 2 9650 7129* 5130
PT7 Elizabethkingia genomosp. 3 403 20724 52921
PT8 Mycobacterium tuberculosis 15 1570 2201
PT10 Human gammaherpesvirus 4 20 2084 2780

Table 3: Pathogen identifications in patients with suspected neurological infections.


KrakenHLL: Confident and fast metagenomics classification using
Muhammad
unique k-mer
Hamid
counts 33
Infectious disease diagnosis

 Dubious identifications have few k-mers

Sample Name Reads K-mers


PT3 Clostridioides difficile 122 126
PT4 Hepatitis C virus 101 3
JF343788.1 Recombinant Hepatitis C virus
PT5 Akkermansia muciniphila 936 136
PT10 uman betaherpesvirus 5 63 5
JN379815.1 UNVERIFIED: Human herpesvirus 5
strain U04, partial genom

Table 4: Doubtful identifications


KrakenHLL: Confident and fast metagenomics classification using
Muhammad
unique k-mer
Hamid
counts 34
Infectious disease diagnosis

 Dubious identifications have few k-mers

Sample Name Reads K-mers


PT3 Clostridioides difficile 122 126
PT4 Hepatitis C virus 101 3
JF343788.1 Recombinant Hepatitis C virus
PT5 Akkermansia muciniphila 936 136
PT10 uman betaherpesvirus 5 63 5
JN379815.1 UNVERIFIED: Human herpesvirus 5
strain U04, partial genom

Table 4: Doubtful identifications


KrakenHLL: Confident and fast metagenomics classification using
Muhammad
unique k-mer
Hamid
counts 35
Storing strain genomes with assembly project

 In 2014 NCBI Taxonomic Project stopped giving new


taxonomical IDs to strains
 New way is to use Bioproject, Biosample and
Assemly accession codes

KrakenHLL: Confident and fast metagenomics classification using


Muhammad
unique k-mer
Hamid
counts 36
Timing and memory requirements

 Average addition memory were less than 1GB


Kraken KrakenHLL
Patient Datasets 118GB 118.35GB

Test Datasets 39.5GB 40GB

 50% faster (avg), when most reads from one species


Kraken KrakenHLL

Patient Datasets 467 Mbp/m 733 Mbp/m

Test Datasets 410 Mbp/m 421 Mbp/m

Table 5: (Up) Memory (Down) Time


KrakenHLL: Confident and fast metagenomics classification using
Muhammad
unique k-mer
Hamid
counts 37
Conclusion

 K-mer based classification with cardinality estimation


 Unique k-mer counts can help discard false results
 Choice of appropriate threshold depends on
application

KrakenHLL gives more confident identifications by


reporting unique k-mer count and coverage without
any runtime penalty.

KrakenHLL: Confident and fast metagenomics classification using


Muhammad
unique k-mer
Hamid
counts 38
Thank you!
Appendix

 HyperLogLog algorithm:

KrakenHLL: Confident and fast metagenomics classification using


Muhammad
unique k-mer
Hamid
counts 40
Appendix

 F1 score:

KrakenHLL: Confident and fast metagenomics classification using


Muhammad
unique k-mer
Hamid
counts 41
References

1. Breitwieser, F.P., Lu, J. and Salzberg, S.L. A review of methods and databases for metagenomic classification and
assembly. Brief Bioinform 2017.
2. Brown, J.R., Bharucha, T. and Breuer, J. Encephalitis diagnosis using metagenomics: application of next
generation sequencing for undiagnosed cases. Journal of Infection 2018.
3. Dadi TH, Renard BY, Wieler LH, Semmler T, Reinert K: SLIMM: species level identification of microorganisms from
metagenomes. PeerJ 2017, 5:e3138.
4. Demo of metagenomic classification using KRAKEN - FROM READS TO RESULTS | Coursera. (n.d.). Retrieved from
https://www.coursera.org/lecture/metagenomics/demo-of-metagenomic-classification-using-kraken-7GIn6
5. Ertl O: New Cardinality Estimation Methods for HyperLogLog Sketches. arXiv:170607290 2017.
6. Flajolet P, Fusy É, Gandouet O, Meunier F: HyperLogLog: the analysis of a near optimal cardinality estimation
algorithm. In AofA: Analysis of Algorithms; 2007-06-17; Juan les Pins, France. Discrete Mathematics and
Theoretical Computer Science; 2007: 137-156.
7. Heule S, Nunkesser M, Hall A: HyperLogLog in practice. 2013:683.
8. Kraken 2 Manual –. (n.d.). Retrieved from http://ccb.jhu.edu/software/kraken/MANUAL.html#classification
9. Mukherjee S, Huntemann M, Ivanova N, Kyrpides NC, Pati A: Large-scale contamination of microbial isolate
genomes by Illumina PhiX control. Stand Genomic Sci 2015, 10:18.
10. Quince C, Walker AW, Simpson JT, Loman NJ, Segata N: Shotgun metagenomics, from sampling to analysis. Nat
Biotechnol 2017, 35:833-844.
11. Salter, S.J., et al. Reagent and laboratory contamination can critically impact sequence-based microbiome
analyses. BMC Biol. 2014;12:87.
12. Wood, D. E., & Salzberg, S. L. (2014). Kraken: ultrafast metagenomic sequence classification using exact
alignments. Genome Biology, 15(3), R46. doi:10.1186/gb-2014-15-3-r46

KrakenHLL: Confident and fast metagenomics classification using


Muhammad
unique k-mer
Hamid
counts 42

You might also like