Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can I submit a subset of kmers from the reads? #30

Open
richarddurbin opened this issue Feb 23, 2018 · 2 comments
Open

Can I submit a subset of kmers from the reads? #30

richarddurbin opened this issue Feb 23, 2018 · 2 comments

Comments

@richarddurbin
Copy link

Hello Rayan et al,

I want the unitigs induced by a subset of the kmers from a read set. Is there any way to create a file of kmers that you can read into bcalm2, and then compact into unitigs?

In fact I want to pick out kmers that I think are single copy in a diploid, based on depth. So a simple way to do that would be for you to allow me to give a maximum as well as a minimum copy number for each kmer. I would then set those by looking at a histogram. But ultimately I would like the freedom to correct the kmer counts by GC content (of their parent read pairs), so it would be more general to allow me to pass you a set of kmers.

I guess I could make a fasta file out of them, with one kmer per sequence...

Thanks, Richard

@richarddurbin
Copy link
Author

By the way, thanks for the algorithm and tool. It is very elegant.

@rchikhi
Copy link
Member

rchikhi commented Feb 23, 2018

Hi Richard, and thanks for the kind words!

For now, you can specify a minimum/maximum kmer abundance threshold using the options -abundance-min A -abundance-max B, noting that it will keep only kmers of abundance A<=x<=B.

Otherwise, the input formats of Bcalm are fasta/fastq reads, or a custom HDF5 file that contains counted k-mers. That format isn't that easy to create from scratch, but not impossible for us to maybe write a custom program for that. In that format, Bcalm expects that kmers are partitioned according to their minimizer.

Giving as input a fasta file with one kmer per sequence would be an acceptable solution, bearing in mind that abundance would be lost (unless the kmer is repeated as many times as its abundance, which isn't very elegant).

What's the volume of kmers that you wish to give to Bcalm: 10's of millions, billions?
Also, in what format do you have the counted k-mers right now?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants