Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High Memory Usage (1TB) for Creating a Kraken2 Database with 400GB of Genomes #887

Open
florian-labadie opened this issue Nov 7, 2024 · 1 comment

Comments

@florian-labadie
Copy link

Hello,

I am currently facing an issue with extremely high memory usage while building a Kraken2 database with around 400GB of genomes. When using a 31-mer size, the database construction seems to require up to 1TB of memory, which exceeds the capacity of my system.

I would like to know if there are any solutions to reduce the memory demand while maintaining the classification accuracy. Here are some approaches I am considering:

  • Optimizing database structure: Are there any Kraken2 settings or configurations that could help lower the memory requirements?

  • Fragmenting the database: Would it be recommended or possible to divide the database into smaller parts (e.g., by taxon) and process these parts in parallel to reduce memory requirements at each step?
    If anyone has encountered a similar issue, I would greatly appreciate any suggestions or recommendations. Any shared experiences would be incredibly helpful.

Thank you in advance for your time and for the great work on Kraken2!

@DexinBo
Copy link

DexinBo commented Nov 7, 2024

According to the manual, you can try to increase --kmer-len and --minimizer-len to reduce memory usage.

MiniKraken: At present, users with low-memory computing environments can replicate the "MiniKraken" functionality of Kraken 1 in two ways: first, by increasing the value of k with respect to l (using the --kmer-len and --minimizer-len options to kraken2-build); and secondly, through downsampling of minimizers (from both the database and query sequences) using a hash function. This second option is performed if the --max-db-size option to kraken2-build is used; however, the two options are not mutually exclusive. In a difference from Kraken 1, Kraken 2 does not require building a full database and then shrinking it to obtain a reduced database.

By default, the values of k and l are 35 and 31, respectively (or 15 and 12 for protein databases). These values can be explicitly set with the --kmer-len and minimizer-len options, however. Note that the minimizer length must be no more than 31 for nucleotide databases, and 15 for protein databases. Additionally, the minimizer length l must be no more than the k-mer length. There is no upper bound on the value of k, but sequences less than k bp in length cannot be classified.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants