Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Neverending clone filtering #9

Open
JGPunier opened this issue Oct 29, 2021 · 9 comments
Open

Neverending clone filtering #9

JGPunier opened this issue Oct 29, 2021 · 9 comments

Comments

@JGPunier
Copy link

Hello
The pipeline gets stuck at the "Stacks - clone_filter" step (that is, at the very first step...) : after a while (few hours), it actually turns out to use all my RAM (60Go !) and a large part of my SWAP memory (60 Go also), leading my computer to freeze (No change in the pipeline status after 12 days...).
This is quite surprising since I routinely use stacks for clone filtering large (~ 33Gbytes) (double-digest) fastq files such as the ones I am struggling with now , and I never encountered such a memory problem.
Would you have any idea of the reason why it would fail here ? is a library or something outdated ?
Thank you in advance for your kind help
Jérôme

@MaartenPostuma
Copy link
Collaborator

MaartenPostuma commented Nov 3, 2021

Hi Jerome,
The version of Stacks that was used was relatively old (Stacks 2.0). I've included a newer version of stacks in the master branch of the pipeline.
Using git checkout origin master you can switch to this branch.

However, clone_filter is known to be relatively resource heavy, so I'm not completely sure that this will solve it for your system.
Kind regards,
Maarten

@MWSchmid
Copy link

MWSchmid commented Nov 9, 2021

Hi Jérôme and Maarten

The clone filter in stacks is also memory hungry. Here an alternative:

Paper

Jar

Run with:

java -Xmx32g -jar NgsReadsTreatment_v1.3.jar prefix_R1.fastq prefix_R2.fastq 32

And remove the wobble bases with fastp:

fastp --trim_front1 3 --trim_front2 3 --disable_adapter_trimming --disable_trim_poly_g --disable_quality_filtering --in1 prefix_R1_1_trated.fastq --in2 prefix_R2_2_trated.fastq --out1 prefix_R1.deRepNoWobble.fq.gz --out2 prefix_R2.deRepNoWobble.fq.gz

Then modify the pipeline to enter at the right step.

Worked with ~1.5 bio reads at once. Reasonable time and a tiny amount of RAM (I think it was less than 10 Gb during the whole process).

Best,

Marc

@IsoldevR
Copy link

IsoldevR commented Nov 9, 2021

Would be great if this could be implemented in a new version of the pipeline! We want to eventually run the pipeline for > 500 samples!

@MaartenPostuma
Copy link
Collaborator

I will have a look into this.
For now if you want to insert the reads created using @MWSchmid's de replication in to the pipeline you can set the Wobble_R1 and Wobble_R2 columns in the barcode file to 0. In this way it will skip the clone_filtering step entirely and move directly to the process_radtags step. You can use the prefix_R1.deRepNoWobble.fq.gz and prefix_R2.deRepNoWobble.fq.gz as the input reads in the config.

However, note that the pipeline is not extremely suited to running multiple lanes at the same time. Fortunately, Snakemake is quite flexible (or easy to fool).

I would recommend running the demultiplexing step separately for each lane using the following command for each set of raw reads with a specific barcode file:
snakemake -j {output}/output_demultiplex/barcode_stacks.tsv
where {output} is the output file specified in the config file

If there are no overlapping individual names all of the files in output_demultiplex can be added to a new {outputAll}/output_demultiplex/
If the config specifies the {outputAll} you can run the denovo creation/mapping/calling from there.

@MWSchmid
Copy link

Thanks Maarten. Though, unless one re-uses barcodes on different lanes (for different samples), you can also just merge all the fastqs, right?

(if the same sample+barcode appears on multiple lanes one just needs to keep it separate while dereplicating, or?)

@IsoldevR
Copy link

Thanks Maarten! We do re-use barcodes in our project though - and I suspect others would as well (e.g. we have barcodes for 144 samples at the lab, so with a project with more numbers we will definitely need to reuse combinations).

@MaartenPostuma
Copy link
Collaborator

Though, unless one re-uses barcodes on different lanes (for different samples), you can also just merge all the fastqs, right?
Correct.
(if the same sample+barcode appears on multiple lanes one just needs to keep it separate while dereplicating, or?)
I think this should work but I haven't had this kind of data.

@IsoldevR
Copy link

exactly

@JGPunier
Copy link
Author

Hello Maarten.
Unfortunately, the new branch you proposed did not resolve my issue...
I decided to process my raw files (33Go each) independently using Stacks:clone_filter out of the pipeline (worked well...I did not try @MWSchmid's solution) and then tried to feed the pipeline with filtered files (11Go each): I expected the process to be fast, or at least faster, but I actually got stuck again, for a whole week... Maybe this was not a smart move...
Anyway, I finally followed your advice (setting the Wobble_R1 and Wobble_R2 columns in the barcode file to 0): process_radtags is currently ongoing.
I hope it will work fine now...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants