-
Notifications
You must be signed in to change notification settings - Fork 273
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue with k2 add-to-library
#859
Comments
Thank you for reporting this issue. I am currently putting together some changes to help speed up k2 as well as iron out bugs such as these. This issue will be addressed in the next commit. |
Thanksm @ch4rr0. One more performance fix needs to be pushed to k2. Currently kraken2-build is using shell commands for sequence to tax id map. k2 is using python script which is slightly slower. Only this step can be delegated to a shell script which will improve k2 build performance. In all other steps, performance of k2 is at par with kraken2-build. I already have few other changes in my repo. master...AvilPage:kraken2:master |
For kraken-db-builder, I switched back to kraken2-build and using a custom md5 check for now. |
Upon further investigation I discovered that this was not a issue with the k2 script. I added a feature to k2mask which would enable it to continue the masking process where it left off. If the fasta file being masked is the same between runs then this should proceed with issue. If the file gets modified I.e. a sequence that was previously masked got added/removed, k2mask will throw the error you received. This feature should have been disabled by default, but was implicitly enabled due to an error on my part. I have since pushed a commit to disable the feature. I have also pushed some updates to the k2 script to improve download reliability and speed with some other performance improvements added. Please let me know if you find any issues with the updates. |
Thanks for the quick update. I will test it out and let you know. Wouldn't be faster to use the previous shell commands for assigning taxids? This is much faster with kraken2-build. |
I'm using multiprocessing so it should be significantly faster now. My ethos is to reduce reliance on shell commands for portability and stability reasons. |
|
Should be fixed in latest commit |
I realized that the performance improvement applied only to download commands. I have gone ahead and applied to add-to-library also. You should now see significant time savings when using this workflow. Please let me know if you encounter any issues. |
Thanks for the quick patch. This is working fine when running it on a single thread. But when I run it using parallel, it still throws error.
|
Another important factor to consider is "resuming" the process when the process stops in the middle. Since building a decent db takes ~2 days, there are high chances that the process gets interrupted. With kraken-db-builder, if the process stops when files are added to the library, it can almost resume instantly as it stores md5 hash at the original file location. If there are ~500,000 files, Even for build_db, we need to provide an option to incrementally update the db. |
The quality of life improvements that you mentioned are being worked on. Right now I am trying to make sure that downloading is as reliable as possible since that's where most users are having issues. |
I maintain additional scripts just for these. If you are okay with these changes, I can raise separate PRs for these. |
After adding ~3000 files or so,
k2 add-to-library
fails with this errorUsing
kraken2-build --add-to-library
works fine though.The text was updated successfully, but these errors were encountered: