Refactor gatk germline (resolves #280) #335

jpfeil · 2016-06-30T22:37:15Z

jvivian · 2016-07-15T07:00:43Z

@jpfeil — Ping me when tests pass and I'll review

hannes-ucsc · 2016-07-18T17:26:05Z

@jpfeil do you want us to review this? If not, please set the "needs work" label.

Come review time, I will also ask for the commit titles to all be prefixed in such a way that they connote the pipeline they refer to.

fnothaft · 2016-08-03T04:20:49Z

WRT 14a268a, alignment with bwa is already implemented in the end-to-end germline pipeline --> https://github.com/BD2KGenomics/toil-scripts/blob/master/src/toil_scripts/adam_gatk_pipeline/align_and_call.py#L98, https://github.com/BD2KGenomics/toil-scripts/blob/master/src/toil_scripts/adam_gatk_pipeline/align_and_call.py#L183.

fnothaft · 2016-08-04T18:25:05Z

@jpfeil Just dropping a reminder to add a list with the required/complete features to this PR.

jpfeil · 2016-08-04T19:54:32Z

Support downloading files from synapse
Add BWA alignment
Support preprocessing only
Refactor pre-processing using common library functions
Add hard filtering for runs with fewer than 30 samples
Support cohort-level VQSR
Annotate clinical variants with Oncotator

fnothaft · 2016-08-04T19:55:41Z

@jpfeil please make the list more specific. What are the changes you are making? What features were requested to be added, removed, changed, etc. E.g., instead of: "Hard filtering", say "Hard filter for sites covered by fewer than 10 reads". Specifics!

fnothaft · 2016-08-05T15:38:06Z

So this PR greatly expanded in scope from what we discussed at the start: eliminate dupe code between germline and somatic pipelines, focus on validation. We need this to merge for the ADAM recompute, so I'd like to triage the Oncotator/annotation and Synapse bits out to a follow-on PR, and set a deadline for validating this code and merging the PR. Specifically, I'd like to propose freezing the features, wrapping up development by Tuesday of next week, and shooting to validate the pipeline on a single sample by the Tuesday after.

I think right now, if we triage the annotation/Synapse work out to another PR, that leaves joint calling as the last feature. At a cursory glance, that code is implemented already in this PR. I am concerned about the implementation—the GVCFs we are using for SGDP are O(25GB), so joint calling on a single node is out of the question—@jpfeil, what is the plan for addressing that issue?

I'm proposing the following action items:

@fnothaft: make review pass by end-of-day today
@jpfeil: open ticket to triage out annotation/Synapse work
@jpfeil: write up validation strategy and timeline on this PR (can you do this in time for Monday scrum?)
@jpfeil: explain joint calling architecture and how the joint calling process will get around excessive data volumes
@fnothaft @jvivian @jpfeil: go through together and audit resource requirements passed to GATK stages
@jpfeil: factor out variant filtration calls to toil_scripts.tools library

Does this proposal seem reasonable? CC @briandoconnor for review/work planning.

jpfeil · 2016-08-05T16:46:46Z

Okay that sounds like a good plan. I will add a modified job function that handles running a single sample through the pipeline with VQSR.

fnothaft · 2016-08-05T16:57:11Z

Shouldn't the single sample case fall under the hard filtering? Also, I thought we had an override that allows running <30 samples through VQSR.

fnothaft · 2016-08-05T17:51:37Z

src/toil_scripts/exome_variant_pipeline/exome_variant_pipeline.py

@@ -108,10 +108,10 @@ def preprocessing_declaration(job, config):
        job.fileStore.logToMaster('Ran preprocessing: ' + config.uuid)
        disk = '1G' if config.ci_test else '20G'


Not a change here, but I've seen OOD failures with the preprocessing pipeline in the wild. I'll create a note for you me and @jvivian to go through this and audit the disk requirements before merging.

jpfeil · 2016-08-05T18:05:54Z

I can add a check to see if there are enough variants in the sample before running VQSR. Can you send me a description of your samples?

fnothaft · 2016-08-05T18:25:48Z

src/toil_scripts/gatk_germline/germline.py


+    elif {ext1, ext2} & {'.bam'}:


Add SAM/CRAM too?

jpfeil · 2016-09-23T00:43:22Z

The test passed @jvivian

jvivian · 2016-09-22T21:37:50Z

src/toil_scripts/gatk_germline/common.py

@@ -0,0 +1,32 @@
+#!/usr/bin/env python2.7


Why is this in its own .py?

To prevent circular imports

jvivian · 2016-09-22T21:44:05Z

src/toil_scripts/gatk_germline/germline.py

-    :param work_dir: working directory
-    :param ids: shared file promises, dict
-    :param filenames: remaining arguments are filenames
+class GermlineSample(namedtuple('GermlineSample', 'uuid url paired_url rg_line')):


I don't quite understand this — Why not just use a Namespace object for every sample?

I think this is clearer because you can link to this class in docstrings and it documents the requirements for a sample in one location.

jvivian · 2016-09-23T21:27:44Z

src/toil_scripts/gatk_germline/germline.py

+                                    disk=PromisedRequirement(lambda x: x.size, annotated_vcf.rv()))
+
+
+#####################################


Hannes does not like special exceptions for comments, either use one # or triple quotes if you want to make it stand out.

jvivian · 2016-09-23T21:32:51Z

src/toil_scripts/gatk_germline/germline.py

+    else:
+        bwa_config.r1 = input1.rv()
+
+    # Use first URL to deduce paired FASTQ URL, if url looks like a paired file.


I do not think this is a good heuristic. I see a lot of unpaired samples that have R1.fq or 1.fq in them.

Okay that's a good point

jvivian · 2016-09-23T21:34:57Z

src/toil_scripts/gatk_germline/germline_config.py

@@ -0,0 +1,146 @@
+import textwrap


maybe change to germline_config_manifest.py ?

jvivian · 2016-09-23T21:35:43Z

src/toil_scripts/gatk_germline/hard_filter.py

@@ -0,0 +1,137 @@
+#!/usr/bin/env python2.7


Why is this in its own .py file? It seems like this and the other file either belong in toil-lib or should be in the germline source.

I think it's easier to read if the independent components of the pipeline are in separate modules.

jpfeil added the in progress label Jun 30, 2016

jpfeil force-pushed the issues/280-refactor-germline-pipeline branch 6 times, most recently from e86be03 to 16bdfaf Compare July 1, 2016 22:57

jpfeil force-pushed the issues/280-refactor-germline-pipeline branch from b8bcde7 to b8fa460 Compare July 15, 2016 04:52

jpfeil force-pushed the issues/280-refactor-germline-pipeline branch from dc0bb84 to 173d82d Compare July 16, 2016 22:17

jpfeil added the needs work label Jul 18, 2016

jpfeil force-pushed the issues/280-refactor-germline-pipeline branch from 2b2cca5 to 7cc21a9 Compare July 19, 2016 02:12

jpfeil force-pushed the issues/280-refactor-germline-pipeline branch from 14a268a to d8c80ab Compare August 2, 2016 23:59

jpfeil force-pushed the issues/280-refactor-germline-pipeline branch 3 times, most recently from af3b95e to da423fa Compare August 3, 2016 21:48

fnothaft mentioned this pull request Aug 5, 2016

Add clinical variant annotation to germline #401

Closed

fnothaft reviewed Aug 5, 2016
View reviewed changes

src/toil_scripts/gatk_germline/germline.py

elif {ext1, ext2} & {'.bam'}:

Copy link

Contributor

fnothaft Aug 5, 2016 •

edited

Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add SAM/CRAM too?

jpfeil added 18 commits September 22, 2016 11:39

Fix bug when realigning sorted BAM

a3f0f11

Run hard filter independently

3074706

Refactored joint genotyping and filtering

3b5c7ec

Removed synapse and unnecessary common lib functions

1dc9457

Removed functions in common lib

33409be

Removed split joint vcf by name

dc1958d

Add VQSR CLI and joint batching

946442e

Improved batching method

d858860

Add CombineGVCFs

7344019

Add VQSR and joint genotype test

6936371

Modified logging statements

14b8ba1

Add pipeline with VQSR test

6db248a

Check number of samples earlier in the pipeline

bcc27b8

Remove default resource values

1496221

Update PromisedRequirements

12daa94

Sync germline lib modules with toil-lib

84048e0

Add parameters for SNP and INDEL 1000G data

3f0877d

Added config requirements for each function

4c319e6

jpfeil force-pushed the issues/280-refactor-germline-pipeline branch from d59dcd2 to 94f20c6 Compare September 22, 2016 19:12

Use common lib generate file function

6dd338e

jpfeil force-pushed the issues/280-refactor-germline-pipeline branch 2 times, most recently from bf725da to 3457393 Compare September 22, 2016 19:52

SQUASH: Add more documentation to the bwakit configuration function

be3f011

jpfeil force-pushed the issues/280-refactor-germline-pipeline branch from 3457393 to be3f011 Compare September 23, 2016 00:07

jvivian requested changes Sep 23, 2016

View reviewed changes

SQUASH: Addressed PR comments

a51c70a

jvivian approved these changes Sep 24, 2016

View reviewed changes

jvivian merged commit 2d7e5bc into master Sep 24, 2016

jvivian removed the in progress label Sep 24, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor gatk germline (resolves #280) #335

Refactor gatk germline (resolves #280) #335

jpfeil commented Jun 30, 2016 •

edited

Loading

jvivian commented Jul 15, 2016

hannes-ucsc commented Jul 18, 2016 •

edited

Loading

fnothaft commented Aug 3, 2016

fnothaft commented Aug 4, 2016

jpfeil commented Aug 4, 2016 •

edited

Loading

fnothaft commented Aug 4, 2016 •

edited

Loading

fnothaft commented Aug 5, 2016 •

edited by jpfeil

Loading

jpfeil commented Aug 5, 2016

fnothaft commented Aug 5, 2016

fnothaft Aug 5, 2016

jpfeil commented Aug 5, 2016

fnothaft Aug 5, 2016 •

edited

Loading

jpfeil commented Sep 23, 2016

jvivian Sep 22, 2016

jpfeil Sep 23, 2016

jvivian Sep 22, 2016

jpfeil Sep 23, 2016

jvivian Sep 23, 2016

jvivian Sep 23, 2016

jpfeil Sep 23, 2016

jvivian Sep 23, 2016

jvivian Sep 23, 2016

jpfeil Sep 23, 2016

		@@ -108,10 +108,10 @@ def preprocessing_declaration(job, config):
		job.fileStore.logToMaster('Ran preprocessing: ' + config.uuid)
		disk = '1G' if config.ci_test else '20G'

		disk=PromisedRequirement(lambda x: x.size, annotated_vcf.rv()))


		#####################################

Refactor gatk germline (resolves #280) #335

Refactor gatk germline (resolves #280) #335

Conversation

jpfeil commented Jun 30, 2016 • edited Loading

jvivian commented Jul 15, 2016

hannes-ucsc commented Jul 18, 2016 • edited Loading

fnothaft commented Aug 3, 2016

fnothaft commented Aug 4, 2016

jpfeil commented Aug 4, 2016 • edited Loading

fnothaft commented Aug 4, 2016 • edited Loading

fnothaft commented Aug 5, 2016 • edited by jpfeil Loading

jpfeil commented Aug 5, 2016

fnothaft commented Aug 5, 2016

Choose a reason for hiding this comment

jpfeil commented Aug 5, 2016

fnothaft Aug 5, 2016 • edited Loading

Choose a reason for hiding this comment

jpfeil commented Sep 23, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jpfeil commented Jun 30, 2016 •

edited

Loading

hannes-ucsc commented Jul 18, 2016 •

edited

Loading

jpfeil commented Aug 4, 2016 •

edited

Loading

fnothaft commented Aug 4, 2016 •

edited

Loading

fnothaft commented Aug 5, 2016 •

edited by jpfeil

Loading

fnothaft Aug 5, 2016 •

edited

Loading