Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kraken 2 finds spurious minimizers after ambiguous regions #881

Open
jtnystrom opened this issue Sep 17, 2024 · 0 comments
Open

Kraken 2 finds spurious minimizers after ambiguous regions #881

jtnystrom opened this issue Sep 17, 2024 · 0 comments

Comments

@jtnystrom
Copy link

jtnystrom commented Sep 17, 2024

Hi all,

I wanted to report an issue that I'm not sure makes sense to fix, but that may be interesting for researchers and users to know about.

Following an ambiguous region, kraken2-build will find a minimizer in a k-1 length window. For example, for k=35, the following k-mer will generate a minimizer:

xAATATTTGATGTCCATTCAATAAAATATATATGT

Because of the leading x, there are only 34 valid nucleotides here but a minimizer will still be inserted. The issue does not apply if x is at the end.

In the standard library, there are numerous masked regions and to the best of my knowledge, this "bug" increases the number of minimizers by about 0.5% - one for each masked region - and also changes classifications (for the better). If I'm right about this finding, the true value of k for Kraken 2 with the default parameter k=35 should be somewhere between 35 and 34.

I believe that the fix for this would be to change mmscanner.h as follows, from:

  bool is_ambiguous() const {
    return (queue_pos_ < k_ - l_) || (!! last_ambig_);
  }

to

  bool is_ambiguous() const {
    return (queue_pos_ < k_ - l_ + 1) || (!! last_ambig_);
  }

This is because the return at the end of NextMinimizer is post incrementing queue_pos, so the former condition is no longer valid. (However, a special case might need to be added in is_ambiguous for _k == _l as that has its own logic to bypass queue management and returns prior to the increment.)

Since this is the behaviour people are used to and presumably what the default parameters have been optimised for, I'm not sure that you actually want to address this bug, but I wanted to let others know just in case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant