High Memory Usage (1TB) for Creating a Kraken2 Database with 400GB of Genomes #887

florian-labadie · 2024-11-07T11:31:55Z

Hello,

I am currently facing an issue with extremely high memory usage while building a Kraken2 database with around 400GB of genomes. When using a 31-mer size, the database construction seems to require up to 1TB of memory, which exceeds the capacity of my system.

I would like to know if there are any solutions to reduce the memory demand while maintaining the classification accuracy. Here are some approaches I am considering:

Optimizing database structure: Are there any Kraken2 settings or configurations that could help lower the memory requirements?
Fragmenting the database: Would it be recommended or possible to divide the database into smaller parts (e.g., by taxon) and process these parts in parallel to reduce memory requirements at each step?
If anyone has encountered a similar issue, I would greatly appreciate any suggestions or recommendations. Any shared experiences would be incredibly helpful.

Thank you in advance for your time and for the great work on Kraken2!

DexinBo · 2024-11-07T12:50:12Z

According to the manual, you can try to increase --kmer-len and --minimizer-len to reduce memory usage.

MiniKraken: At present, users with low-memory computing environments can replicate the "MiniKraken" functionality of Kraken 1 in two ways: first, by increasing the value of k with respect to l (using the --kmer-len and --minimizer-len options to kraken2-build); and secondly, through downsampling of minimizers (from both the database and query sequences) using a hash function. This second option is performed if the --max-db-size option to kraken2-build is used; however, the two options are not mutually exclusive. In a difference from Kraken 1, Kraken 2 does not require building a full database and then shrinking it to obtain a reduced database.

By default, the values of k and l are 35 and 31, respectively (or 15 and 12 for protein databases). These values can be explicitly set with the --kmer-len and minimizer-len options, however. Note that the minimizer length must be no more than 31 for nucleotide databases, and 15 for protein databases. Additionally, the minimizer length l must be no more than the k-mer length. There is no upper bound on the value of k, but sequences less than k bp in length cannot be classified.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High Memory Usage (1TB) for Creating a Kraken2 Database with 400GB of Genomes #887

High Memory Usage (1TB) for Creating a Kraken2 Database with 400GB of Genomes #887

florian-labadie commented Nov 7, 2024

DexinBo commented Nov 7, 2024

High Memory Usage (1TB) for Creating a Kraken2 Database with 400GB of Genomes #887

High Memory Usage (1TB) for Creating a Kraken2 Database with 400GB of Genomes #887

Comments

florian-labadie commented Nov 7, 2024

DexinBo commented Nov 7, 2024