You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I wanted to report an issue that I'm not sure makes sense to fix, but that may be interesting for researchers and users to know about.
Following an ambiguous region, kraken2-build will find a minimizer in a k-1 length window. For example, for k=35, the following k-mer will generate a minimizer:
xAATATTTGATGTCCATTCAATAAAATATATATGT
Because of the leading x, there are only 34 valid nucleotides here but a minimizer will still be inserted. The issue does not apply if x is at the end.
In the standard library, there are numerous masked regions and to the best of my knowledge, this "bug" increases the number of minimizers by about 0.5% - one for each masked region - and also changes classifications (for the better). If I'm right about this finding, the true value of k for Kraken 2 with the default parameter k=35 should be somewhere between 35 and 34.
I believe that the fix for this would be to change mmscanner.h as follows, from:
This is because the return at the end of NextMinimizer is post incrementing queue_pos, so the former condition is no longer valid. (However, a special case might need to be added in is_ambiguous for _k == _l as that has its own logic to bypass queue management and returns prior to the increment.)
Since this is the behaviour people are used to and presumably what the default parameters have been optimised for, I'm not sure that you actually want to address this bug, but I wanted to let others know just in case.
The text was updated successfully, but these errors were encountered:
Hi all,
I wanted to report an issue that I'm not sure makes sense to fix, but that may be interesting for researchers and users to know about.
Following an ambiguous region, kraken2-build will find a minimizer in a k-1 length window. For example, for k=35, the following k-mer will generate a minimizer:
xAATATTTGATGTCCATTCAATAAAATATATATGT
Because of the leading x, there are only 34 valid nucleotides here but a minimizer will still be inserted. The issue does not apply if x is at the end.
In the standard library, there are numerous masked regions and to the best of my knowledge, this "bug" increases the number of minimizers by about 0.5% - one for each masked region - and also changes classifications (for the better). If I'm right about this finding, the true value of k for Kraken 2 with the default parameter k=35 should be somewhere between 35 and 34.
I believe that the fix for this would be to change mmscanner.h as follows, from:
to
This is because the return at the end of NextMinimizer is post incrementing queue_pos, so the former condition is no longer valid. (However, a special case might need to be added in is_ambiguous for
_k == _l
as that has its own logic to bypass queue management and returns prior to the increment.)Since this is the behaviour people are used to and presumably what the default parameters have been optimised for, I'm not sure that you actually want to address this bug, but I wanted to let others know just in case.
The text was updated successfully, but these errors were encountered: