Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with k2 add-to-library #859

Open
ChillarAnand opened this issue Jul 31, 2024 · 13 comments
Open

Issue with k2 add-to-library #859

ChillarAnand opened this issue Jul 31, 2024 · 13 comments

Comments

@ChillarAnand
Copy link

After adding ~3000 files or so, k2 add-to-library fails with this error

+ k2 add-to-library --db k2mdb --file /avilpage.com/anand/GCF_001347155.1_7054_6_6_genomic.fna
Calculating MD5 sum for GCF_001347155.1_7054_6_6_genomic.fna.
MD5 sum is 69bd16c04c88ce08b57bb515e9d1c4cc
Generating prelim_map.txt...done.
Masking low-complexity regions of added library.Could not file masked sequence in input
Please delete masked_sequences.txt before retrying.
Traceback (most recent call last):
  File "/usr/local/bin/k2", line 2337, in <module>
    k2_main()
  File "/usr/local/bin/k2", line 2266, in k2_main
    add_to_library(args)
  File "/usr/local/bin/k2", line 558, in add_to_library
    mask_files(
  File "/usr/local/bin/k2", line 518, in mask_files
    masker(fin, i + 1 == number_of_files)
  File "/usr/local/bin/k2", line 494, in masker
    shutil.copyfileobj(input_file, p.stdin)
  File "/usr/lib/python3.10/shutil.py", line 198, in copyfileobj
    fdst_write(buf)
BrokenPipeError: [Errno 32] Broken pipe

Using kraken2-build --add-to-library works fine though.

@ch4rr0
Copy link
Collaborator

ch4rr0 commented Aug 2, 2024

Thank you for reporting this issue. I am currently putting together some changes to help speed up k2 as well as iron out bugs such as these. This issue will be addressed in the next commit.

@ChillarAnand
Copy link
Author

Thanksm @ch4rr0.

One more performance fix needs to be pushed to k2.

Currently kraken2-build is using shell commands for sequence to tax id map.

k2 is using python script which is slightly slower. Only this step can be delegated to a shell script which will improve k2 build performance.

In all other steps, performance of k2 is at par with kraken2-build.

I already have few other changes in my repo. master...AvilPage:kraken2:master

@ChillarAnand
Copy link
Author

For kraken-db-builder, I switched back to kraken2-build and using a custom md5 check for now.

https://github.com/AvilPage/kraken-db-builder

@ch4rr0
Copy link
Collaborator

ch4rr0 commented Aug 7, 2024

Upon further investigation I discovered that this was not a issue with the k2 script.

I added a feature to k2mask which would enable it to continue the masking process where it left off. If the fasta file being masked is the same between runs then this should proceed with issue. If the file gets modified I.e. a sequence that was previously masked got added/removed, k2mask will throw the error you received. This feature should have been disabled by default, but was implicitly enabled due to an error on my part. I have since pushed a commit to disable the feature.

I have also pushed some updates to the k2 script to improve download reliability and speed with some other performance improvements added. Please let me know if you find any issues with the updates.

@ChillarAnand
Copy link
Author

Thanks for the quick update. I will test it out and let you know.

Wouldn't be faster to use the previous shell commands for assigning taxids? This is much faster with kraken2-build.

@ch4rr0
Copy link
Collaborator

ch4rr0 commented Aug 7, 2024

I'm using multiprocessing so it should be significantly faster now. My ethos is to reduce reliance on shell commands for portability and stability reasons.

@ChillarAnand
Copy link
Author

❯ k2 add-to-library --db k2_protozoa --files /home/anand/.cache/kdb/refseq/protozoa/GCF_900002385.2/GCF_900002385.2_GCA_900002385_genomic.fna
Calculating MD5 sum for /home/anand/.cache/kdb/refseq/protozoa/GCF_900002385.2/GCF_900002385.2_GCA_900002385_genomic.fna.
MD5 sum of /home/anand/.cache/kdb/refseq/protozoa/GCF_900002385.2/GCF_900002385.2_GCA_900002385_genomic.fna is 1ca8e6f7ee3b5287443a578e40d0fe57
Generating prelim_map.txt.
Finished generating prelim_map.txt.
Traceback (most recent call last):
  File "/usr/local/bin/k2", line 2505, in <module>
    k2_main()
  File "/usr/local/bin/k2", line 2428, in k2_main
    add_to_library(args)
  File "/usr/local/bin/k2", line 559, in add_to_library
    [destination], destination + ".masked", protein=args.protein, threads=args.threads
AttributeError: 'Namespace' object has no attribute 'threads'

k2 is failing with error now.

@ch4rr0
Copy link
Collaborator

ch4rr0 commented Aug 7, 2024

Should be fixed in latest commit

@ch4rr0
Copy link
Collaborator

ch4rr0 commented Aug 7, 2024

I realized that the performance improvement applied only to download commands. I have gone ahead and applied to add-to-library also. You should now see significant time savings when using this workflow.

Please let me know if you encounter any issues.

@ChillarAnand
Copy link
Author

Thanks for the quick patch.

This is working fine when running it on a single thread.

But when I run it using parallel, it still throws error.

$ find /AvilPage/refseq -name '*.fna' | head -n 10000 | parallel -j96 'k2 add-to-library --db k2_test --files {}'
Masking low-complexity regions of added library.Could not file masked sequence in input
Please delete masked_sequences.txt before retrying.
Traceback (most recent call last):
  File "/usr/local/bin/k2", line 2346, in <module>
    k2_main()
  File "/usr/local/bin/k2", line 2275, in k2_main
    add_to_library(args)
  File "/usr/local/bin/k2", line 557, in add_to_library
    mask_files(
  File "/usr/local/bin/k2", line 518, in mask_files
    masker(fin, i + 1 == number_of_files)
  File "/usr/local/bin/k2", line 494, in masker
    shutil.copyfileobj(input_file, p.stdin)
  File "/usr/lib/python3.10/shutil.py", line 198, in copyfileobj
    fdst_write(buf)
BrokenPipeError: [Errno 32] Broken pipe

@ChillarAnand
Copy link
Author

Another important factor to consider is "resuming" the process when the process stops in the middle. Since building a decent db takes ~2 days, there are high chances that the process gets interrupted.

With kraken-db-builder, if the process stops when files are added to the library, it can almost resume instantly as it stores md5 hash at the original file location.

If there are ~500,000 files, k2 takes ~8 hours just to resume at the midpoint. This needs to be improved as well.

Even for build_db, we need to provide an option to incrementally update the db.

@ch4rr0
Copy link
Collaborator

ch4rr0 commented Aug 9, 2024

The quality of life improvements that you mentioned are being worked on. Right now I am trying to make sure that downloading is as reliable as possible since that's where most users are having issues.

@ChillarAnand
Copy link
Author

I maintain additional scripts just for these. If you are okay with these changes, I can raise separate PRs for these.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants