Neverending clone filtering #9

JGPunier · 2021-10-29T07:39:37Z

Hello
The pipeline gets stuck at the "Stacks - clone_filter" step (that is, at the very first step...) : after a while (few hours), it actually turns out to use all my RAM (60Go !) and a large part of my SWAP memory (60 Go also), leading my computer to freeze (No change in the pipeline status after 12 days...).
This is quite surprising since I routinely use stacks for clone filtering large (~ 33Gbytes) (double-digest) fastq files such as the ones I am struggling with now , and I never encountered such a memory problem.
Would you have any idea of the reason why it would fail here ? is a library or something outdated ?
Thank you in advance for your kind help
Jérôme

MaartenPostuma · 2021-11-03T08:59:01Z

Hi Jerome,
The version of Stacks that was used was relatively old (Stacks 2.0). I've included a newer version of stacks in the master branch of the pipeline.
Using git checkout origin master you can switch to this branch.

However, clone_filter is known to be relatively resource heavy, so I'm not completely sure that this will solve it for your system.
Kind regards,
Maarten

MWSchmid · 2021-11-09T14:03:41Z

Hi Jérôme and Maarten

The clone filter in stacks is also memory hungry. Here an alternative:

Paper

Jar

Run with:

java -Xmx32g -jar NgsReadsTreatment_v1.3.jar prefix_R1.fastq prefix_R2.fastq 32

And remove the wobble bases with fastp:

fastp --trim_front1 3 --trim_front2 3 --disable_adapter_trimming --disable_trim_poly_g --disable_quality_filtering --in1 prefix_R1_1_trated.fastq --in2 prefix_R2_2_trated.fastq --out1 prefix_R1.deRepNoWobble.fq.gz --out2 prefix_R2.deRepNoWobble.fq.gz

Then modify the pipeline to enter at the right step.

Worked with ~1.5 bio reads at once. Reasonable time and a tiny amount of RAM (I think it was less than 10 Gb during the whole process).

Best,

Marc

IsoldevR · 2021-11-09T14:23:44Z

Would be great if this could be implemented in a new version of the pipeline! We want to eventually run the pipeline for > 500 samples!

MaartenPostuma · 2021-11-11T11:09:56Z

I will have a look into this.
For now if you want to insert the reads created using @MWSchmid's de replication in to the pipeline you can set the Wobble_R1 and Wobble_R2 columns in the barcode file to 0. In this way it will skip the clone_filtering step entirely and move directly to the process_radtags step. You can use the prefix_R1.deRepNoWobble.fq.gz and prefix_R2.deRepNoWobble.fq.gz as the input reads in the config.

However, note that the pipeline is not extremely suited to running multiple lanes at the same time. Fortunately, Snakemake is quite flexible (or easy to fool).

I would recommend running the demultiplexing step separately for each lane using the following command for each set of raw reads with a specific barcode file:
snakemake -j {output}/output_demultiplex/barcode_stacks.tsv
where {output} is the output file specified in the config file

If there are no overlapping individual names all of the files in output_demultiplex can be added to a new {outputAll}/output_demultiplex/
If the config specifies the {outputAll} you can run the denovo creation/mapping/calling from there.

MWSchmid · 2021-11-11T13:26:24Z

Thanks Maarten. Though, unless one re-uses barcodes on different lanes (for different samples), you can also just merge all the fastqs, right?

(if the same sample+barcode appears on multiple lanes one just needs to keep it separate while dereplicating, or?)

IsoldevR · 2021-11-11T13:31:51Z

Thanks Maarten! We do re-use barcodes in our project though - and I suspect others would as well (e.g. we have barcodes for 144 samples at the lab, so with a project with more numbers we will definitely need to reuse combinations).

MaartenPostuma · 2021-11-11T13:37:09Z

Though, unless one re-uses barcodes on different lanes (for different samples), you can also just merge all the fastqs, right?
Correct.
(if the same sample+barcode appears on multiple lanes one just needs to keep it separate while dereplicating, or?)
I think this should work but I haven't had this kind of data.

IsoldevR · 2021-11-11T13:40:18Z

exactly

JGPunier · 2021-11-19T13:37:47Z

Hello Maarten.
Unfortunately, the new branch you proposed did not resolve my issue...
I decided to process my raw files (33Go each) independently using Stacks:clone_filter out of the pipeline (worked well...I did not try @MWSchmid's solution) and then tried to feed the pipeline with filtered files (11Go each): I expected the process to be fast, or at least faster, but I actually got stuck again, for a whole week... Maybe this was not a smart move...
Anyway, I finally followed your advice (setting the Wobble_R1 and Wobble_R2 columns in the barcode file to 0): process_radtags is currently ongoing.
I hope it will work fine now...

MaartenPostuma mentioned this issue Nov 22, 2021

cutadapt issue #10

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Neverending clone filtering #9

Neverending clone filtering #9

JGPunier commented Oct 29, 2021

MaartenPostuma commented Nov 3, 2021 •

edited

Loading

MWSchmid commented Nov 9, 2021 •

edited

Loading

IsoldevR commented Nov 9, 2021

MaartenPostuma commented Nov 11, 2021

MWSchmid commented Nov 11, 2021

IsoldevR commented Nov 11, 2021

MaartenPostuma commented Nov 11, 2021

IsoldevR commented Nov 11, 2021

JGPunier commented Nov 19, 2021

Neverending clone filtering #9

Neverending clone filtering #9

Comments

JGPunier commented Oct 29, 2021

MaartenPostuma commented Nov 3, 2021 • edited Loading

MWSchmid commented Nov 9, 2021 • edited Loading

IsoldevR commented Nov 9, 2021

MaartenPostuma commented Nov 11, 2021

MWSchmid commented Nov 11, 2021

IsoldevR commented Nov 11, 2021

MaartenPostuma commented Nov 11, 2021

IsoldevR commented Nov 11, 2021

JGPunier commented Nov 19, 2021

MaartenPostuma commented Nov 3, 2021 •

edited

Loading

MWSchmid commented Nov 9, 2021 •

edited

Loading