Software repository for our article Integration of mutational signature analysis with 3D chromatin data unveils differential AID-related mutagenesis in indolent lymphomas, for reproducibility purposes.
But if you want, you can use you own data too, everything is automated so it will be easy to run if you want a general landscape of mutational signatures in your samples.
- Creation of mutation list from VCF files (optional)
- Collection of variants extra info (context, AID motifs, SNV in Ig loci)
- SBS signature extraction using SigProfiler
- Sample fitting against COSMIC Signatures using deconstructSigs
- Signature reconstruction using NNLS approach
- A report including plots to graphically visualize the obtained results
First, you need to have installed Nextflow (>=20.07) and Singularity.
You have two options: starting from the VCFs or starting from a list of variants.
-
If you want to start from VCFs:
Prepare a CSV file with 3 columns:
- name: will be used as a sample name for the corresponding file
- group: will be used to separe your samples in the general representations of your samples (for example it could be pathology, sample origin, etc)
- file: VCF path, it is recommended to use absolute paths to avoid issues related with that
It should should look like this:
name,group,file CLL_01,CLL/MBL,/home/catg/vcf/CLL_01.snp.filter.som.recode.vcf.hg38_multianno.vcf CLL_02,CLL/MBL,/home/catg/vcf/CLL_02.snp.filter.som.recode.vcf.hg38_multianno.vcf FL_01_1,FL,/home/catg/vcf/FL_01_1.filter.som.recode.vcf.hg38_multianno.vcf FL_01_2,FL,/home/catg/vcf/FL_01_2.filter.som.recode.vcf.hg38_multianno.vcf
-
If you already have a list with your variants:
Basically you need to create a CSV file with this format:
sample,group,chrom,pos,ref,alt CLL_01,CLL/MBL,4,89250352,T,C CLL_01,CLL/MBL,5,49600750,T,C CLL_01,CLL/MBL,5,49600906,A,C
To run run the pipeline, execute:
nextflow run CATG-UMAG/bcell-lymphomas-mutational-signatures -r main <params>
In <params>
, you need to provide inputs and other options. These are:
Parameter | Required | Default | Description |
---|---|---|---|
--vcf_list |
yes* | Input CSV if you want to start with the VCFs (according to previous section). Ignored if --snv_list is not empty. |
|
--snv_list |
yes* | Input CSV if you want to start with the list of variants (according to previous section). | |
--reference |
yes | Reference in 2bit format. Must be the same used in the variant calling. For example: hg19 or hg38 | |
--ig_list |
yes | Bed file containing the ranges for the Ig loci. Check data/iglist_hg38.bed for a example. |
|
--nsignatures_min |
no | 2 | Minimum number of signatures to test with sigprofiler. |
--nsignatures_max |
no | 5 | Maximum number of signatures to test with sigprofiler. |
--nsignatures_force |
no | Ignore the recomendation from SigProfiler regarding the optimal number of signatures, and use a fixed number of signatures as final output. Must be a number between nsignatures_min and nsignatures_max values (both inclusive). |
|
--cosmic_version |
no | 3.2 | Version of COSMIC signatures to use. Check data/cosmic_signatures_urls.csv for possible options. |
--cosmic_genome |
no | GRCh38 | COSMIC signatures genome. Check data/cosmic_signatures_urls.csv for possible options. |
--fitting_selected_signatures |
no | Select only a set of reference signatures for the fitting. The value should be a string containing valid signature names from the COSMIC version selected, separated by commas. Example: "SBS1,SBS3,SBS5,SBS6,SBS9,SBS84" | |
--fitting_extra_signatures |
no | Provide additional (local) signatures for the fitting. Must be a CSV file, check data/extra_signatures.csv for the format. |
|
--results_dir |
no | results | Output directory to store the results. |
--sigprofiler_cpus |
no | 8 | Number of CPUs to use with SigProfiler. |
--sigprofiler_gpu |
no | False | Use a GPU in SigProfiler. It must be a supported CUDA device. |
So, for example, a full execution command should look like this:
nextflow run CATG-UMAG/bcell-lymphomas-mutational-signatures -r main \
--snv_list data/snv_list.csv --reference data/hg38.2bit --ig_list data/iglist_hg38.bed \
--nsignatures_min 2 --nsignatures_max 10 --fitting_selected_signatures 'SBS1,SBS3,SBS5,SBS6,SBS9,SBS84'
Alternatively, you can provide a yaml file containing all the parameters you want to setup (that way you don't have to write everything on the command line). Just download params.example.yml
and edit it to your needs (you can delete parameters from the file if you don't want to use them). Then execute the pipeline like this:
nextflow run CATG-UMAG/bcell-lymphomas-mutational-signatures -r main --params-file params.yml
You can also use any option available in Nextflow.
It's also very easy to run on a computing cluster (as long as Singularity is available). I included a profile for SLURM (-profile slurm
), if your cluster uses a different scheduler, you should look here to find the corresponding configuration.
Once the pipeline finished running you will find a set of files. These are:
snv_list.csv
: a CSV file with all the variants (if you used variant list as input it will be the same file with extra columns)extraction/
signatures.csv
: the signatures extracted from your samplescontributions.csv
: a list containing the number of mutations contributed by each signature to every one of your samplesstatistics.csv
: metrics collected from the extraction of the different number of signaturessigprofiler_out
: the raw output from SigProfiler
fitting.csv
: the results of the sample fitting process using reference signaturesreconstruction/
: reconstruction of each one of the extracted denovo signatures using reference signaturesreport/
: a summary of all the obtained information with plots, in.html
for easy visualization and.ipynb
(Jupyter Notebook) for editing
If this repository was useful for you, please cite it as below:
Sepulveda-Yanez JH, Alvarez-Saravia D, Fernandez-Goycoolea J, Aldridge J, van Bergen CAM, Posthuma W, Uribe-Paredes R, Veelken H, Navarrete MA. Integration of Mutational Signature Analysis with 3D Chromatin Data Unveils Differential AID-Related Mutagenesis in Indolent Lymphomas. International Journal of Molecular Sciences. 2021; 22(23):13015. https://doi.org/10.3390/ijms222313015
- Python libraries: cyvcf2, twobitreader, SigProfilerExtractor and all of its dependencies
- R libraries: cluster, cowplot, deconstructSigs, factoextra, IRkernel, NNLS, R.utils, tidyverse
- Others: Jupyter, Nextflow, Singularity
In containers/
you can find the recipes used to build the containers for the pipeline (hosted in GitHub Container Registry). These are the ones configured in nextflow.config
, alongside others from BioContainers.