-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor gatk germline (resolves #280) #335
Conversation
e86be03
to
16bdfaf
Compare
b8bcde7
to
b8fa460
Compare
@jpfeil — Ping me when tests pass and I'll review |
dc0bb84
to
173d82d
Compare
@jpfeil do you want us to review this? If not, please set the "needs work" label. Come review time, I will also ask for the commit titles to all be prefixed in such a way that they connote the pipeline they refer to. |
2b2cca5
to
7cc21a9
Compare
14a268a
to
d8c80ab
Compare
WRT 14a268a, alignment with bwa is already implemented in the end-to-end germline pipeline --> https://github.com/BD2KGenomics/toil-scripts/blob/master/src/toil_scripts/adam_gatk_pipeline/align_and_call.py#L98, https://github.com/BD2KGenomics/toil-scripts/blob/master/src/toil_scripts/adam_gatk_pipeline/align_and_call.py#L183. |
af3b95e
to
da423fa
Compare
@jpfeil Just dropping a reminder to add a list with the required/complete features to this PR. |
|
@jpfeil please make the list more specific. What are the changes you are making? What features were requested to be added, removed, changed, etc. E.g., instead of: "Hard filtering", say "Hard filter for sites covered by fewer than 10 reads". Specifics! |
So this PR greatly expanded in scope from what we discussed at the start: eliminate dupe code between germline and somatic pipelines, focus on validation. We need this to merge for the ADAM recompute, so I'd like to triage the Oncotator/annotation and Synapse bits out to a follow-on PR, and set a deadline for validating this code and merging the PR. Specifically, I'd like to propose freezing the features, wrapping up development by Tuesday of next week, and shooting to validate the pipeline on a single sample by the Tuesday after. I think right now, if we triage the annotation/Synapse work out to another PR, that leaves joint calling as the last feature. At a cursory glance, that code is implemented already in this PR. I am concerned about the implementation—the GVCFs we are using for SGDP are O(25GB), so joint calling on a single node is out of the question—@jpfeil, what is the plan for addressing that issue? I'm proposing the following action items:
Does this proposal seem reasonable? CC @briandoconnor for review/work planning. |
Okay that sounds like a good plan. I will add a modified job function that handles running a single sample through the pipeline with VQSR. |
Shouldn't the single sample case fall under the hard filtering? Also, I thought we had an override that allows running <30 samples through VQSR. |
@@ -108,10 +108,10 @@ def preprocessing_declaration(job, config): | |||
job.fileStore.logToMaster('Ran preprocessing: ' + config.uuid) | |||
disk = '1G' if config.ci_test else '20G' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a change here, but I've seen OOD failures with the preprocessing pipeline in the wild. I'll create a note for you me and @jvivian to go through this and audit the disk requirements before merging.
I can add a check to see if there are enough variants in the sample before running VQSR. Can you send me a description of your samples? |
|
||
elif {ext1, ext2} & {'.bam'}: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add SAM/CRAM too?
d59dcd2
to
94f20c6
Compare
bf725da
to
3457393
Compare
3457393
to
be3f011
Compare
The test passed @jvivian |
@@ -0,0 +1,32 @@ | |||
#!/usr/bin/env python2.7 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this in its own .py?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To prevent circular imports
:param work_dir: working directory | ||
:param ids: shared file promises, dict | ||
:param filenames: remaining arguments are filenames | ||
class GermlineSample(namedtuple('GermlineSample', 'uuid url paired_url rg_line')): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't quite understand this — Why not just use a Namespace object for every sample?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is clearer because you can link to this class in docstrings and it documents the requirements for a sample in one location.
disk=PromisedRequirement(lambda x: x.size, annotated_vcf.rv())) | ||
|
||
|
||
##################################### |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hannes does not like special exceptions for comments, either use one # or triple quotes if you want to make it stand out.
else: | ||
bwa_config.r1 = input1.rv() | ||
|
||
# Use first URL to deduce paired FASTQ URL, if url looks like a paired file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think this is a good heuristic. I see a lot of unpaired samples that have R1.fq or 1.fq in them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay that's a good point
@@ -0,0 +1,146 @@ | |||
import textwrap |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe change to germline_config_manifest.py
?
@@ -0,0 +1,137 @@ | |||
#!/usr/bin/env python2.7 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this in its own .py file? It seems like this and the other file either belong in toil-lib or should be in the germline source.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's easier to read if the independent components of the pipeline are in separate modules.
resolves #280 #426 #425 #401
@jvivian @fnothaft @hannes-ucsc