This repository contains a Snakemake workflow for the vg toolkit. With it, you can:
- index a pangenome with vg
- map short reads to the pangenome with vg Giraffe
- call small variants with DeepVariant
- genotype structural variants with vg call
- call structural variants with manta
If running the all rule, all the analysis listed above will be performed. Specific analysis can also be performed using dedicated rules described below. Of note, the pipeline will automatically run the necessary steps for each analysis, which means, for example, that the read mapping will automatically be performed (if necessary) when using a variant calling rule.
Note: for the older version of the Snakemake pipeline see the vg_genotyping_2020 branch.
This step is not necessary when using an existent pangenome (e.g. the HPRC pangenome, see below).
Starting from a GFA, the index_pangenome
rule creates the GBZ, distance and minimizer indexes.
It starts from an unzipped GFA file, specified as the gfa field of the config. A graph name (graph) is also provided to label the output files.
For example, to run the test data:
gunzip -c testdata/mhc.gfa.gz > testdata/mhc.gfa
snakemake --config gfa=testdata/mhc.gfa graph=mhc-test -p index_pangenome --cores 2
The indexes will be placed in the results/pg
directory.
There are several ways to analyze short-read sequencing data. The inputs might be in gzipped FASTQ files, but they might also be in CRAM files (e.g. to reanalyze data from existing projects). Some steps are optional and can be enabled/disabled, for example the trimming of the raw reads, realignment of aligned reads, the use of GPU.
We'll start by describing how to specify the input files and the main default analysis. At the end, we'll explain how to switch on/off specific options and run the workflow in HPCs.
The path to the input files should be compiled in a TSV file with a sample column, with the sample name, and other columns informing the path to the files for each sample. This TSV file is specified by sample_tsv in the config.
Input FASTQ files are taken from columns fq1 and fq2 of the TSV specified by sample_tsv in the config. See config/sample_info.tsv and config/config.cram.yaml for an example.
It is also possible to start from a CRAM file. In that case, a cram column should have the path to the CRAM file in the TSV file. In addition, the reference FASTA used to make the CRAMs must be specified in the config with cram_ref. See config/sample_info.cram.tsv and config/config.cram.yaml for an example. Note: fq1/fq2 and cram columns can all be present in the TSV but, if both are available for a sample, the FASTQs will be used.
If sample if specified in the config, only this sample, or these samples (if it's an array or white-space-separated list of samples), will be run. Otherwise, all samples in the sample_tsv TSV file will be analyzed.
This will map short reads to the pangenome using vg Giraffe and produce a gzipped GAF file available at results/{sample}.{graph}.gaf.gz
snakemake --configfile config/config.yaml sample=samp1 -p map_short_reads --cores 2 -n
(remove -n
, which means "dry-run", to actually run the workflow)
Here, the reads will be projected to a linear reference genome represented by paths in the pangenome. The user can (should?) specify a text file listing those reference paths (one path per line). This paths list file is specified in the config with ref_paths_list.
snakemake --configfile config/config.yaml -p call_small_variants_from_short_reads --cores 2 --use-singularity -n
If needed, the reads will be mapped to the pangenome first.
Same as above, the genotyped variants will ultimately be represented relative to a linear reference genome. The user can specify the list of reference paths in the config with ref_paths_list, a text file listing the names of those paths.
snakemake --configfile config/config.yaml -p genotype_variants_from_short_reads --cores 2 -n
If needed, the pangenome will be indexed and the reads mapped with the rules described above.
Sometimes, we might want to genotype and report all variants sites, even if homozygous for the reference allele.
For example, to merge VCFs for multiple samples.
This mode can be enabled by setting gt_ref
to True
in the config (either in the config.yaml
file, or --config gt_ref=True
in the command line).
If gt_ref
is true, vg call
will use the -a
parameter to force reporting on every snarl.
snakemake --configfile config/config.yaml -p all --cores 2 -n
In general, make sure the files are dated consistently with their dependencies.
For example, if the GBZ file is more recent than the other indexes, the workflow will re-create them.
The order, from the oldest to most recent file, is: .gbz
, .dist
, .hapl
, .ref.fa
, .ref.fa.fai
.
Download the Minigraph-Cactus from the human-pangenomics/hpp_pangenome_resources repo:
wget https://s3-us-west-2.amazonaws.com/human-pangenomics/pangenomes/freeze/freeze1/minigraph-cactus/hprc-v1.1-mc-grch38/hprc-v1.1-mc-grch38.gbz
wget https://s3-us-west-2.amazonaws.com/human-pangenomics/pangenomes/freeze/freeze1/minigraph-cactus/hprc-v1.1-mc-grch38/hprc-v1.1-mc-grch38.hapl
We'll project the reads/variants on GRCh38 so we've prepared a list of paths corresponding to the main chromosomes (see config/hprc-v1.1-mc-grch38.paths_list.txt). This list was made using:
vg paths -RL -x hprc-v1.1-mc-grch38.gbz | grep GRCh38 | grep -v "_" | grep -v EBV > config/hprc-v1.1-mc-grch38.paths_list.txt
The path names were also reordered (manually) because this order define the order in the BAMs.
By pointing to those file in the config (see config/config.hprc.yaml), the workflow can be run with:
snakemake --configfile config/config.hprc.yaml --cores 8 --use-singularity -p all -n
The config/config.hprc.yaml
config file will of course need to be tweaked to specify your sample information (sample_tsv
field).
Main information to provide through the config (file or command line):
graph
: a name/label for the pangenome graphgfa
: a pangenome file in GFA format. Must be unzipped.gbz
: a pangenome file in GBZ format. At leastgfa
orgbz
must be specified.hapl
: a kmer index for each haplotype for the pangenome. Used by Giraffe to map read on a haplotype-sampled pangenome. Will be recomputed if missing.ref_paths_list
: text file listing the paths in the pangenome to use as reference when projecting the reads or calling variants.seqn_prefix
: the prefix to remove from these reference paths in the final BAM/VCF files. See Chromosome re-naming below.sample_tsv
: the TSV file defining the sample and their input files (FASTQ, CRAM). See Input FASTQ/CRAM section above.sample
: which samples to analyze? Either a sample name, a string with sample names separated by a white space, or an array of sample names. If missing or empty, all samples will be processed.
Other options:
ref_fa
: the linear reference genome sequence corresponding to the reference paths specified byref_paths_list
. Will be extracted from the pangenome if missing. See Linear reference sequence and indexes below.gt_min_var_len
: the minimum length for variants to be genotyped with vg. Default is 30 (bp). Decrease, down to 0, to genotype small variants too.indel_realign_reads
: should read be realigned to improve indel detection? See Read re-alignment section below.cram_ref
: a reference for input CRAMs. See Input FASTQ/CRAM section above.refsynt_fa
/adapters_fa
/adapters_tsv
: synthetic contaminants and adapter information to trim the raw FASTQ. See Read trimming section below.use_gpu
should GPU be used for DeepVariant? See Using GPUs section below to set up GPUs.rm_all_on_success
: should the results files be removed at the end? Useful to control disk space usage when paired with an archiving script. See Minimizing local disk usage section below.
The final steps of the pipeline, like variant calling, use a linear reference genome information.
Specifically, they use a FASTA file, an FAI index for that file, and a sequence dictionnary.
If not provided by the user, the sequence will be extracted from the pangenome and indexed.
Although this is fast, the user can still provide a reference FASTA file and indexes.
The FASTA file is specified in ref_fa
of the config.
The index should be located in the same directory as the FASTA file and named appropriately, for example ref.fa.fai
and ref.dict
for a ref.fa
FASTA file.
This is only recommended if those files are already available. If not, no need to prepare them, the pipeline will do it.
If adapters and synthetic contaminant are provided, the raw reads in the FASTQs will first be trimmed. QC of the FASTQ files before and after trimming is also performed.
The synthetic contaminant sequence(s) is provided as a FASTA file using refsynt_fa in the config.
The adapters sequence(s) is provided in both FASTA format and a TSV file with adapters_fa and adapters_tsv.
See examples of these files in config/config.trimming.yaml
, testdata/refsynt.fa
, testdata/adapters-test.fa
, and testdata/adapters-test.tsv
.
The reads can be realigned after projection to the linear reference genome. We use ABRA2 for this step. It improves the indel called by DeepVariant but increases the runtime significantly. To switch the realignment off, set indel_realign_reads to False in the config.
It is only recommended if interested in the best performance for indel discovery. If the main goal of the analysis is to look for SNVs, or genotype SVs, indel_realign_reads should be set to False. Worst case, the BAMs could always be realigned later if really needed.
The seqn_prefix value, in the config, specifies the prefix that should be removed from the sequence names.
For example, the HPRC pangenomes uses paths names that look like GRCh38#0#chr1
to inform the genome of origin, haplotype, and contig name.
Hence, in the config file (e.g. config/config.hprc.yaml
, we use GRCh38#0# as seqn_prefix.
Caution: if provided as input files, the reference FASTA, index and dict, should already have this prefix stripped. The workflow won't strip this sequence prefix from user-provided files.
In practice, we can run one job (with very low resources) where we call the snakemake
command and that will launch more jobs for each task.
The minimal dependencies would be Python (with snakemake and pandas installed) and Singularity.
Then, the snakemake
command could look like:
snakemake --configfile my.config.yaml --slurm --profile profile/default --use-singularity -p all
In some cases, we might want to use tools already installed on the HPC, for example by loading "modules", instead of the singularity/docker containers. This is not implemented yet, but could be added if there is interest.
If DeepVariant is run through containers with Singularity (default currently), a parameter needs to be added to enable GPU usage.
In addition to --use-singularity
, we should add --singularity-args '--nv -B .:/dum'
to the snakemake command.
This will make sure that the singularity instances are run with --nv
.
(The -B .:/dum
part is just a hack to avoid a bug in snakemake.)
In addition, the value of use_gpu
must be set to True
in the config, either by editing the YAML config file passed with --configfile
, or by adding --config use_gpu=True
to the snakemake command.
In summary, the snakemake command, when using GPUs, looks like this:
snakemake --configfile my.config.yaml -p all --use-singularity --singularity-args '--nv -B .:/dum' --config use_gpu=True
Using Slurm, it would look like this:
snakemake --configfile my.config.yaml -p all --slurm --profile profile/default --use-singularity --singularity-args '--nv -B .:/dum' --config use_gpu=True
Activating GPUs on Slurm will depend on the slurm environment and vary from platform to platform.
To adapt to your Slurm environment, modify dv_call_variants_gpu:slurm_extra
in the profile configuration file (passed with --profile
, see default config).
This parameter specifies which parameter should be passed to Slurm run a job with GPUs.
For example, it's currently dv_call_variants_gpu:slurm_extra=--partition=gpuq --gres=gpu:A100_1g.10gb:1
because we need to include --partition=gpuq --gres=gpu:A100_1g.10gb:1
to use special gpuq queue and specify GPUs in our Slurm environment (Genotoul).
The first way to minimize disk usage is to make sure temporary files don't accumulate. For this reason, temporary files have been marked so that they will be removed as soon as they are not needed.
Secondly, if we have a large number of samples, we would prefer to run them in batches rather than starting all the samples at once.
Higher priority has been given to tasks later in the pipeline so jobs that continue/finish ongoing samples will be preferred over jobs starting new ones.
For the same reason , we can also force the pipeline to not analyze more than N samples at a time by creating fake dependencies.
See an example in workflow/Snakefile_disk_opt_slurm
and the max_samps config variable it uses.
Another approach might be to migrate/archive the results once they are ready (and delete the local copy from the server).
To do this, one could tweak the main Snakefile
, adding a rule to perform this file transfer. Example coming soon.
For this mode, we can use rm_all_on_success=True
in the Snakemake configuration to make sure even the "final" output files are removed as soon as they are not needed (in this scenario, once they've been migrated/archived).
snakemake --use-singularity --configfile config/config.yaml -p all --cores 2 -n