Skip to content

Commit

Permalink
Add disambiguate tool to QC portion of the weave pipeline (#33)
Browse files Browse the repository at this point in the history
* feat: add disambiguate to docker image

* feat: add short read aligners to docker

* feat: add disambiguate functionality to pipeline

* fix: applying testing fixes for disambiguate feat

* fix: remove type

* fix: natural name sort bams

* fix: help genome alias table and remote cache assets

* fix: add actual help messages to argparse arguments

* fix: correct readme

---------

Co-authored-by: Ryan Routsong <[email protected]>
  • Loading branch information
rroutsong and Ryan Routsong authored Jan 16, 2024
1 parent ea086fc commit b13c507
Show file tree
Hide file tree
Showing 11 changed files with 216 additions and 36 deletions.
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,9 @@

<h1>weave 🔬</h1>

**_An awesome metagenomic and metatranscriptomics pipeline_**
**_An awesome demultiplexing and quality control pipeline_**

[![tests](https://github.com/OpenOmics/weave/workflows/tests/badge.svg)](https://github.com/OpenOmics/weave/actions/workflows/main.yaml) [![docs](https://github.com/OpenOmics/weave/workflows/docs/badge.svg)](https://github.com/OpenOmics/weave/actions/workflows/docs.yml) [![GitHub issues](https://img.shields.io/github/issues/OpenOmics/weave?color=brightgreen)](https://github.com/OpenOmics/weave/issues) [![GitHub license](https://img.shields.io/github/license/OpenOmics/weave)](https://github.com/OpenOmics/weave/blob/main/LICENSE)
[![tests](https://github.com/OpenOmics/weave/workflows/tests/badge.svg)](https://github.com/OpenOmics/weave/actions/workflows/main.yaml) [![docs](https://github.com/OpenOmics/weave/actions/workflows/doc.yml/badge.svg)](https://github.com/OpenOmics/weave/actions/workflows/docs.yml) [![GitHub issues](https://img.shields.io/github/issues/OpenOmics/weave?color=brightgreen)](https://github.com/OpenOmics/weave/issues) [![GitHub license](https://img.shields.io/github/license/OpenOmics/weave)](https://github.com/OpenOmics/weave/blob/main/LICENSE)

<i>
This is the home of the pipeline, weave. Its long-term goals: to provide accurate quantification, taxonomic classification, and functional profiling of assembled (bacteria and archaea) metagenomes!
Expand Down
6 changes: 3 additions & 3 deletions config/bigsky.json
Original file line number Diff line number Diff line change
Expand Up @@ -3,17 +3,17 @@
"mounts": {
"kaiju": {
"to": "/opt/kaiju",
"from": "/gs1/RTS/OpenOmics/references/Dmux/kaiju/kaiju_db_nr_euk_2023-05-10",
"from": "/gs1/RTS/OpenOmics/references/weave/kaiju/kaiju_db_nr_euk_2023-05-10",
"mode": "ro"
},
"kraken2" : {
"to": "/opt/kraken2",
"from": "/gs1/RTS/OpenOmics/references/Dmux/kraken2/k2_pluspfp_20230605",
"from": "/gs1/RTS/OpenOmics/references/weave/kraken2/k2_pluspfp_20230605",
"mode": "ro"
},
"fastq_screen" : {
"to": "/fdb/fastq_screen/FastQ_Screen_Genomes",
"from": "/gs1/RTS/OpenOmics/references/Dmux/FastQ_Screen_Genomes",
"from": "/gs1/RTS/OpenOmics/references/weave/FastQ_Screen_Genomes",
"mode": "ro"
}
}
Expand Down
4 changes: 2 additions & 2 deletions config/biowulf.json
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,12 @@
"mounts": {
"kaiju": {
"to": "/opt/kaiju",
"from": "/data/OpenOmics/references/Dmux/kaiju/kaiju_db_nr_euk_2023-05-10",
"from": "/data/OpenOmics/references/weave/kaiju/kaiju_db_nr_euk_2023-05-10",
"mode": "ro"
},
"kraken2" : {
"to": "/opt/kraken2",
"from": "/data/OpenOmics/references/Dmux/kraken2/k2_pluspfp_20230605",
"from": "/data/OpenOmics/references/weave/kraken2/k2_pluspfp_20230605",
"mode": "ro"
},
"fastq_screen" : {
Expand Down
2 changes: 2 additions & 0 deletions config/remote.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
{
"bcl2fastq": "docker://umccr/bcl2fastq:latest",
"weave": "docker://rroutsong/weave_ngsqc:0.0.1",
"bclconvert": "docker://rroutsong/weave_bclconvert:0.0.3",
"disambiguate": "docker://quay.io/biocontainers/ngs-disambiguate:2018.05.03--hf393df8_7",
"kraken": "https://genome-idx.s3.amazonaws.com/kraken/k2_pluspfp_16gb_20231009.tar.gz",
"kaiju": "https://kaiju-idx.s3.eu-central-1.amazonaws.com/2023/kaiju_db_nr_euk_2023-05-10.tgz",
"fastq_screen": "filelist://www.bioinformatics.babraham.ac.uk/projects/fastq_screen/genome_locations.txt"
Expand Down
5 changes: 4 additions & 1 deletion docker/NGS_QC/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,13 @@ ENV TZ=America/New_York
ENV DEBIAN_FRONTEND=noninteractive
RUN ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && echo $TZ > /etc/timezone
RUN apt-get update; apt-get upgrade -y bash && \
apt-get install -y fastqc multiqc fastp kraken2 bowtie2 libgd-graph-perl git curl libz-dev build-essential
apt-get install -y fastqc python-is-python3 rna-star bwa hisat2 python3-pip samtools multiqc fastp kraken2 bowtie2 libgd-graph-perl git curl libz-dev build-essential
RUN mkdir /opt2; curl -S -s -L https://github.com/StevenWingett/FastQ-Screen/archive/refs/tags/v0.15.3.tar.gz --output /opt2/v0.15.3.tar.gz
RUN cd /opt2 && tar xvf v0.15.3.tar.gz
RUN cd /opt2 && git clone https://github.com/bioinformatics-centre/kaiju.git
RUN cd /opt2/kaiju/src; make
RUN cd /opt2 && git clone https://github.com/AstraZeneca-NGS/disambiguate.git
RUN pip install pysam
RUN chmod +x /opt2/disambiguate/disambiguate.py; ln -sf /opt2/disambiguate/disambiguate.py /usr/bin/disambiguate.py
ENV PATH="${PATH}:/opt2/FastQ-Screen-0.15.3:/opt2/kaiju/bin"
ADD docker/NGS_QC/fastq_screen.conf /etc/fastq_screen.conf
38 changes: 38 additions & 0 deletions scripts/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -158,4 +158,42 @@ def get_bigsky_seq_dirs():
"seq": get_biowulf_seq_dirs(),
"profile": Path(Path(__file__).parent.parent, "utils", "profiles", "biowulf").resolve(),
}
}


GENOME_CONFIGS = {
"biowulf": {
"hg19": "/vf/users/OpenOmics/references/genomes/human/hg19/GRCh37.primary_assembly.genome.fa.gz",
"grch37": "/vf/users/OpenOmics/references/genomes/human/GRCh37/GRCh37.primary_assembly.genome.fa.gz",
"hg38": "/vf/users/OpenOmics/references/genomes/human/hg38/GRCh38.p14.genome.fa.gz",
"grch38": "/vf/users/OpenOmics/references/genomes/human/GRCh38/GRCh38.p14.genome.fa.gz",
"mm9": "/vf/users/OpenOmics/references/genomes/mouse/mm9/mm9.fa.gz",
"grcm37": "/vf/users/OpenOmics/references/genomes/mouse/GRCm37/mm9.fa.gz",
"mm10": "/vf/users/OpenOmics/references/genomes/mouse/mm10/GRCm38.p4.genome.fa.gz",
"grcm38": "/vf/users/OpenOmics/references/genomes/mouse/GRCm38/GRCm38.p4.genome.fa.gz",
"mm39": "/vf/users/OpenOmics/references/genomes/mouse/mm39/GRCm39.genome.fa.gz",
"grcm39": "/vf/users/OpenOmics/references/genomes/mouse/GRCm39/GRCm3s9.genome.fa.gz",
"rhemac10": "/vf/users/OpenOmics/references/genomes/rhesus/rhe10/rheMac10.fa.gz",
"mmul10": "/vf/users/OpenOmics/references/genomes/rhesus/mmul10/rheMac10.fa.gz",
"agm": "/vf/users/OpenOmics/references/genomes/agm/1.1/GCF_000409795.2_Chlorocebus_sabeus_1.1_genomic.fna.gz",
"mesaur": "/vf/users/OpenOmics/references/genomes/mesaur/2.0/GCF_017639785.1_BCM_Maur_2.0_genomic.fna.gz",
"cynomac": "/vf/users/OpenOmics/references/genomes/cynomac/v2/GCF_012559485.2_MFA1912RKSv2_genomic.fna.gz",
},
"bigsky": {
"hg19": "/gs1/RTS/OpenOmics/references/genomes/human/hg19/GRCh37.primary_assembly.genome.fa.gz",
"grch37": "/gs1/RTS/OpenOmics/references/genomes/human/GRCh37/GRCh37.primary_assembly.genome.fa.gz",
"hg38": "/gs1/RTS/OpenOmics/references/genomes/human/hg38/GRCh38.p14.genome.fa.gz",
"grch38": "/gs1/RTS/OpenOmics/references/genomes/human/GRCh38/GRCh38.p14.genome.fa.gz",
"mm9": "/gs1/RTS/OpenOmics/references/genomes/mouse/mm9/mm9.fa.gz",
"grcm37": "/gs1/RTS/OpenOmics/references/genomes/mouse/GRCm37/mm9.fa.gz",
"mm10": "/gs1/RTS/OpenOmics/references/genomes/mouse/mm10/GRCm38.p4.genome.fa.gz",
"grcm38": "/gs1/RTS/OpenOmics/references/genomes/mouse/GRCm38/GRCm38.p4.genome.fa.gz",
"mm39": "/gs1/RTS/OpenOmics/references/genomes/mouse/mm39/GRCm39.genome.fa.gz",
"grcm39": "/gs1/RTS/OpenOmics/references/genomes/mouse/GRCm39/GRCm39.genome.fa.gz",
"rhemac10": "/gs1/RTS/OpenOmics/references/genomes/rhesus/rhe10/rheMac10.fa.gz",
"mmul10": "/gs1/RTS/OpenOmics/references/genomes/rhesus/mmul10/rheMac10.fa.gz",
"agm": "/gs1/RTS/OpenOmics/references/genomes/agm/1.1/GCF_000409795.2_Chlorocebus_sabeus_1.1_genomic.fna.gz",
"mesaur": "/gs1/RTS/OpenOmics/references/genomes/mesaur/2.0/GCF_017639785.1_BCM_Maur_2.0_genomic.fna.gz",
"cynomac": "/gs1/RTS/OpenOmics/references/genomes/cynomac/v2/GCF_012559485.2_MFA1912RKSv2_genomic.fna.gz",
},
}
30 changes: 28 additions & 2 deletions scripts/files.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,12 @@
# ~~~~~~~~~~~~~~~
# file system helper functions for the Dmux software package
# ~~~~~~~~~~~~~~~
from pathlib import Path
import xml.etree.ElementTree as ET
from pathlib import Path
from os import access as check_access, R_OK, W_OK
from functools import partial
from .samplesheet import IllumniaSampleSheet
from .config import get_current_server, LABKEY_CONFIGS, DIRECTORY_CONFIGS
from .config import get_current_server, GENOME_CONFIGS, DIRECTORY_CONFIGS


def get_all_seq_dirs(top_dir, server):
Expand Down Expand Up @@ -48,6 +48,32 @@ def valid_run_output(output_directory, dry_run=False):
return output_directory


def valid_fasta(suspect):
server_genomes = GENOME_CONFIGS[get_current_server()]
is_valid = False
if suspect.lower() in server_genomes:
is_valid = True
suspect = server_genomes[suspect.lower()]
else:
the_suspect = Path(suspect)
exts = the_suspect.suffixes
if any([
'.fna' in exts,
'.fa' in exts,
'.fasta' in exts,
'.fna' in exts and '.gz' in exts,
'.fa' in exts and '.gz' in exts,
'.fasta' in exts and '.gz' in exts,
]):
is_valid = True
suspect = str(Path(suspect).absolute())

if not is_valid:
raise ValueError

return suspect


def get_all_staged_dirs(top_dir, server):
return list(filter(partial(is_dir_staged, server), get_all_seq_dirs(top_dir, server)))

Expand Down
42 changes: 40 additions & 2 deletions scripts/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,14 +8,15 @@
import os
import yaml
import sys
import textwrap
from argparse import ArgumentTypeError
from dateutil.parser import parse as date_parser
from subprocess import Popen, PIPE, STDOUT
from pathlib import Path, PurePath

# ~~~ internals ~~~
from .files import parse_samplesheet, mk_or_pass_dirs
from .config import SNAKEFILE, DIRECTORY_CONFIGS, get_current_server, get_resource_config
from .config import SNAKEFILE, DIRECTORY_CONFIGS, GENOME_CONFIGS, get_current_server, get_resource_config


host = get_current_server()
Expand All @@ -40,6 +41,19 @@ def default(self, obj):
return super(self).default(obj)


def get_alias_table():
return textwrap.dedent("""Genome short name alias table:
+----------------+-------------------------------------------+
| Organism | Genomes supported (aka) |
+----------------+-------------------------------------------+
| HUMAN | hg19(grch37) / hg38(grch38) |
+----------------+-------------------------------------------+
| MOUSE | mm9(grcm37) / mm10(grcm38) / mm39(grcm39) |
+----------------+-------------------------------------------+
| RHESUS MACAQUE | RHEMAC10(mmul10) |
+----------------+-------------------------------------------+""")


def valid_runid(id_to_check):
'''
Given an input ID get it's validity against the run id format:
Expand Down Expand Up @@ -233,6 +247,9 @@ def exec_pipeline(configs, dry_run=False, local=False):
if not bclcon_log_dir.exists():
bclcon_log_dir.mkdir(mode=0o755, parents=True)
extra_to_mount.append(str(bclcon_log_dir) + ":" + "/var/log/bcl-convert:rw")
if this_config.get('disambiguate', False):
extra_to_mount.append(Path(this_config['host_genome']).parent)
extra_to_mount.append(Path(this_config['pathogen_genome']).parent)
singularity_binds = get_mounts(*extra_to_mount)
config_file = Path(this_config['out_to'], '.config', f'config_job_{str(i)}.json').absolute()
json.dump(this_config, open(config_file, 'w'), cls=PathJSONEncoder, indent=4)
Expand Down Expand Up @@ -275,4 +292,25 @@ def is_bclconvert(samplesheet):
check = False
if samplesheet.instrument in BCLCONVERT_INSTRUMENTS or samplesheet.platform in BCLCONVERT_PLATFORMS:
check = True
return check
return check


def valid_host_pathogen_genomes(host, pathogen):
g1, g2 = False, False
genomes = GENOME_CONFIGS[get_current_server()]

if Path(host).absolute().exists():
g1 = True
host = str(Path(host).absolute())

if Path(pathogen).absolute().exists():
g2 = True
pathogen = str(Path(pathogen).absolute())

if not all([g1, g2]):
if not g1:
raise ValueError('Host genome does not exist on the file system.')
if not g2:
raise ValueError('Pathogen genome does not exist on the file system.')

return host, pathogen
49 changes: 33 additions & 16 deletions weave
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,22 @@ def run(args):
exec_config['bcl_files'].append(bcls)
exec_config['demux_data'].append(files.check_if_demuxed(rundir))

# ~~~ disambiguate genome configuration ~~~
if all([args.host, args.pathogen]):
if 'disambiguate' not in exec_config:
exec_config['disambiguate'] = []
if 'host_genome' not in exec_config:
exec_config['host_genome'] = []
if 'pathogen_genome' not in exec_config:
exec_config['pathogen_genome'] = []

utils.valid_host_pathogen_genomes(args.host, args.pathogen)
exec_config['disambiguate'].append(True)
exec_config['host_genome'].append(args.host)
exec_config['pathogen_genome'].append(args.pathogen)
else:
assert(not any([args.host, args.pathogen])), 'Must specify both host and pathogen genometype!'

# ~~~ QC/QA configuration ~~~
exec_config['bclconvert'].append(utils.is_bclconvert(sample_sheet))
exec_config['run_ids'].append(rundir.name)
Expand Down Expand Up @@ -67,41 +83,42 @@ def unlock_dir(sub_args):

# ~~~~ argument parsing commands ~~~~
if __name__ == '__main__':
main_parser = argparse.ArgumentParser(prog='weave')
main_parser = argparse.ArgumentParser(prog='weave', formatter_class=argparse.RawTextHelpFormatter, epilog=utils.get_alias_table())
sub_parsers = main_parser.add_subparsers(help='run subcommand help')

parser_run = sub_parsers.add_parser('run')
parser_run = sub_parsers.add_parser('run', formatter_class=argparse.RawTextHelpFormatter, epilog=utils.get_alias_table())
parser_run.add_argument('rundir', metavar='<run directory>', nargs="+", type=utils.valid_run_input,
help='Full & complete run id (format YYMMDD_INSTRUMENTID_TIME_FLOWCELLID) or absolute paths')
help='Full & complete run id (format YYMMDD_INSTRUMENTID_TIME_FLOWCELLID) or absolute paths.')
parser_run.add_argument('-s', '--seq_dir', metavar='<sequencing directory>', default=None, type=str,
help='Root directory for sequencing data (defaults for biowulf/bigsky/locus), must contain directories ' + \
'matching run ids, if not using full paths.')
parser_run.add_argument('-o', '--output', metavar='<output directory>', default=None, type=str,
help='Top-level output directory for demultiplexing data (defaults to input directory + runid + "_demux")')
help='Top-level output directory for demultiplexing data (defaults to input directory + runid + "_demux").')
parser_run.add_argument('-d', '--dry-run', action='store_true',
help='Dry run the demultiplexing workflow')
help='Dry run the demultiplexing workflow.')
parser_run.add_argument('-n', '--noqc', action='store_false',
help='Do not run the QC/QA portion of the workflow (Default is on)')
help='Do not run the QC/QA portion of the workflow (Default is on).')
parser_run.add_argument('--sheetname', metavar='Sample Sheet Filename',
help='Name of the sample sheet file to look for (default is SampleSheet.csv)')
help='Name of the sample sheet file to look for (default is SampleSheet.csv).')
parser_run.add_argument('-l', '--local', action='store_true',
help='Execute pipeline locally without a dispatching executor')
# force endedness flags
endedness = parser_run.add_mutually_exclusive_group()
endedness.add_argument('--single_end', default=None, action='store_true',
help='Declare endedness of run (in cases in which it is not detectable from sample sheet), mutally exclusive to paired_end')
endedness.add_argument('--paired_end', default=None, action='store_true',
help='Declare endedness of run (in cases in which it is not detectable from sample sheet), mutally exclusive to single_end')
help='Execute pipeline locally without a dispatching executor.')

# disambiguate arguments
parser_run.add_argument('-t', '--host', type=files.valid_fasta, default=None,
help='Full path to host genome for disambiguate to use or short name genome alias.')

parser_run.add_argument('-p', '--pathogen', type=files.valid_fasta, default=None,
help='Full path to pathogen/graft/parasite genome for disambiguate to use or short name genome alias.')

parser_cache = sub_parsers.add_parser('cache')
parser_cache.add_argument('cachedir', metavar='<cache directory>', type=cache.valid_dir,
help='Relative or absolute path to directory for cache storage')
help='Relative or absolute path to directory for cache storage.')
parser_cache.add_argument('-l', '--local', action='store_true',
help='Execute pipeline locally without a dispatching executor')

parser_unlock = sub_parsers.add_parser('unlock')
parser_unlock.add_argument('unlockdir', metavar='<directory to unlock>', type=cache.valid_dir,
help='Relative or absolute path to directory for cache storage')
help='Full path to directory to unlock.')

parser_run.set_defaults(func = run)
parser_cache.set_defaults(func = get_cache)
Expand Down
9 changes: 9 additions & 0 deletions workflow/Snakefile
Original file line number Diff line number Diff line change
Expand Up @@ -128,6 +128,15 @@ else:
if config["runqc"]:
all_outputs.extend(qa_qc_outputs)

if config.get('disambiguate', False):
all_outputs.extend(flatten([
expand("{out_to}/{project}/{sids}/disambiguate/{sids}.ambiguousSpeciesA.bam", **demux_expand_args),
expand("{out_to}/{project}/{sids}/disambiguate/{sids}.ambiguousSpeciesB.bam", **demux_expand_args),
expand("{out_to}/{project}/{sids}/disambiguate/{sids}.disambiguatedSpeciesA.bam", **demux_expand_args),
expand("{out_to}/{project}/{sids}/disambiguate/{sids}.disambiguatedSpeciesB.bam", **demux_expand_args),
expand("{out_to}/{project}/{sids}/disambiguate/{sids}_summary.txt", **demux_expand_args),
]))


rule all:
input:
Expand Down
Loading

0 comments on commit b13c507

Please sign in to comment.