Skip to content
This repository has been archived by the owner on Oct 2, 2020. It is now read-only.

[WIP] CNVkit tool definitions #93

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

anton-khodak
Copy link

@anton-khodak anton-khodak commented May 31, 2016

Standing PR to add tool descriptions (created by argparse2cwl) and tests for CNVkit tools .

Issues I encountered on first steps:

  • .gitignore prohibits to add bioinformatics stuff, i.e. test data, to the repository. How is it supposed to be tested then?
  • since I have no experience with bioinformatics yet, I don't know which data to use for running tools. I used random files with proper extensions which I downloaded from the Internet, but that approach doesn't work, for example, I got an error while running:
$ cnvkit.py batch --processes 1 
--normal test-files/s5DE199B-D6AF-C6EC-678A-DEC1179D1B97.fastq 
--fasta test-files/cnvkit-batch/ERCC92.fa 
--targets test-files/InfiniumPsychArray-24v1-1_A1.bed 
-annotate test-files/cnvkit-batch/refFlat.txt 
--split --access test-files/InfiniumPsychArray-24v1-1_A1.bed 
--output-dir . --scatter --diagram
Detected file format: BED
Applying annotations as target names
Splitting large targets
Traceback (most recent call last):
  File "/usr/local/bin/cnvkit.py", line 11, in <module>
    args.func(args)
  File "/usr/local/lib/python3.4/dist-packages/cnvlib/commands.py", line 96, in _cmd_batch
    args.processes, args.count_reads)
  File "/usr/local/lib/python3.4/dist-packages/cnvlib/commands.py", line 138, in batch_make_reference
    else {}))
  File "/usr/local/lib/python3.4/dist-packages/cnvlib/commands.py", line 327, in do_targets
    ['chromosome', 'start', 'end', 'name'])
  File "/usr/local/lib/python3.4/dist-packages/cnvlib/gary.py", line 66, in from_rows
    table = pd.DataFrame.from_records(rows, columns=columns)
  File "/usr/local/lib/python3.4/dist-packages/pandas/core/frame.py", line 939, in from_records
    first_row = next(data)
  File "/usr/local/lib/python3.4/dist-packages/cnvlib/target.py", line 287, in split_targets
    for chrom, start, end, name in region_rows:
  File "/usr/local/lib/python3.4/dist-packages/cnvlib/target.py", line 21, in assign_names
    ref_genes = read_refflat_genes(refflat_fname)
  File "/usr/local/lib/python3.4/dist-packages/cnvlib/target.py", line 80, in read_refflat_genes
    name, _rx, chrom, strand, start, end, _ex = parse_refflat_line(line)
  File "/usr/local/lib/python3.4/dist-packages/cnvlib/target.py", line 133, in parse_refflat_line
    assert len(exons) == int(exon_count), (
TypeError: object of type 'zip' has no len()

I think this error might be caused by irrelevant data.

Also, I couldn't find copy number reference profile sample files (.cnn) at all. If somebody who uses CNVkit frequently could give me a hint where to take proper data, my work in testing would have been much facilitated.

  • I didn't find where tool-name-test.yaml file format is specified. It was intuitively understandable what to write there, but I wish somebody pointed the standard for those files.
  • I didn't work with Docker images before, I need to spend some time learning how to write Dockerfiles.

log-CNR of chrX; otherwise male samples would have -1 chrX).
inputBinding:
position: 2
prefix: --male-reference
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CWL tip: for Argparse only the position dependent arguments need their position specified. Arguments that have a prefix like --male-reference can occur in any order, so it would be nice if cwlargparse didn't specify the unneeded positions in these cases.

@mr-c
Copy link
Member

mr-c commented Jun 8, 2016

@mr-c
Copy link
Member

mr-c commented Jun 8, 2016

I think it is fine to just check in the generated descriptions; don't worry about writing a specific test. As long as the generated output parses, that is good enough for now.

@brainstorm
Copy link
Collaborator

brainstorm commented Jun 10, 2016

I'm totally with @mr-c, we should focus first on CWL, not on specific tools since the amount of work can be quite substantial. If you want to see whether one of the CNVkit subtools works it's fine to dedicate some focused effort, but by no means aiming to cover the whole suite of tools.

Hope that makes sense ;)

@brainstorm
Copy link
Collaborator

OTOH, for a good example on how to test different tools (in my case SV callers), MetaSV has it quite well wrapped up:

https://github.com/bioinform/metasv

But this is just an example, don't spend too much time looking through it.

@anton-khodak
Copy link
Author

anton-khodak commented Jun 10, 2016

@brainstorm , that's great! I misinterpreted the goal of the PR, it was not to pass Travis checks but to merely validate those tools. In that case, I'll fix the job file (@mr-c pointed indirectly on that issue) and push all other tools.

UPD. I should have looked more closely at test/cwltest.py... Travis CI checks the mere validity of tools, not how they are executed (with or without errors).

#################################################################

FROM python:2.7
MAINTAINER Anton Khodak <[email protected]>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! Thanks for wrapping this on a docker container 👍

@etal
Copy link

etal commented Jun 21, 2016

Hi guys, I'm happy to help with testing CNVkit and/or tweaking the test suite to play better with argparse2cwl. You can skip wrapping anything marked "deprecated" (e.g. loh, genome2access), those parts will be removed in the next release. Just let me know anything else you need.

@brainstorm
Copy link
Collaborator

@etal, very happy to have you help Anton with that. I was looking at the outputs generated by argparse2cwl yesterday but since I never used CNVkit before, I'm missing a few bits of domain expertise there, so help is super welcome, thanks!

prefix: --diagram

outputs:
[]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are several outputs from this command and they vary based on the input BAM filenames and the options given.

  • For each tumor/test-sample BAM named e.g. Sample.bam, the outputs are: "Sample.targetcoverage.cnn", "Sample.antitargetcoverage.cnn", "Sample.cnr", "Sample.cns"
  • If the --scatter option is given, then for each tumor/test sample, "Sample-scatter.pdf" is created
  • Similarly, the --diagram option creates "Sample-diagram.pdf"
  • For all of the above, if -d/--output-dir is specified, the created file names are relative to (i.e. in) that specified directory
  • If the -r/--reference option is not given, then a .cnn file is created either with the filename given by --output-reference (regardless of the -d/--output-dir path) or by default "cnv_reference.cnn"

@etal
Copy link

etal commented Sep 19, 2016

I've released a new minor version of CNVkit that drops the deprecated parts and introduces a few new options. I think the current CWL wrappers in Anton's repo should still work, but batch has a new --method option that's worth exposing. Let me know if there's anything else I can do to help complete and maintain these wrappers.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants