From b7c4bbd24cbf1a3ccd2b3d6627f8b4f7db740814 Mon Sep 17 00:00:00 2001 From: Shuangxi Ji Date: Wed, 8 May 2024 10:35:17 -0500 Subject: [PATCH] Update README Includes descriptions about the acceptance of MuSE 2 by Genome Research. --- MuSE.Snakemake/README.md | 24 +++++++++++++++++------- README.md | 23 +++++++++++++++-------- 2 files changed, 32 insertions(+), 15 deletions(-) diff --git a/MuSE.Snakemake/README.md b/MuSE.Snakemake/README.md index 70bc164..9c88236 100644 --- a/MuSE.Snakemake/README.md +++ b/MuSE.Snakemake/README.md @@ -33,7 +33,6 @@ We suggest to download them from the Broad Institute Resource Bundle, and save t - Homo_sapiens_assembly38.dict - Homo_sapiens_assembly38.fasta.fai - Homo_sapiens_assembly38.dbsnp138.vcf - - Homo_sapiens_assembly38.dbsnp138.vcf.idx - Mills_and_1000G_gold_standard.indels.hg38.vcf.gz - Mills_and_1000G_gold_standard.indels.hg38.vcf.gz.tbi @@ -42,10 +41,21 @@ We suggest to download them from the Broad Institute Resource Bundle, and save t - Homo_sapiens_assembly19.dict - Homo_sapiens_assembly19.fasta.fai - Homo_sapiens_assembly19.dbsnp.vcf - - Homo_sapiens_assembly19.dbsnp.vcf.idx - Mills_and_1000G_gold_standard.indels.b37.vcf.gz - Mills_and_1000G_gold_standard.indels.b37.vcf.gz.tbi +**Note:** For Homo_sapiens_assembly38.dbsnp138.vcf and Homo_sapiens_assembly19.dbsnp.vcf, please use the following commands to compress and index: + +``` +bgzip -c Homo_sapiens_assembly38.dbsnp138.vcf > Homo_sapiens_assembly38.dbsnp138.vcf.gz +tabix -p vcf Homo_sapiens_assembly38.dbsnp138.vcf.gz +``` +or +``` +bgzip -c Homo_sapiens_assembly19.dbsnp.vcf > Homo_sapiens_assembly19.dbsnp.vcf.gz +tabix -p vcf Homo_sapiens_assembly19.dbsnp.vcf.gz +``` + Additionaly, Strelka2 requires a bed file to specific the contigs to call mutations. One can download it from here: hg38 (download both [hg38.bed.gz](https://drive.google.com/file/d/1vrZuTrkRfP6e1agexpHJdST-JZpRmpjc/view?usp=sharing) and [hg38.bed.gz.tbi](https://drive.google.com/file/d/1PXq-AnqUmZHNfPpxfMwFed0D3TkU6pOS/view?usp=sharing)), hg19 (download both [hg19.bed.gz](https://drive.google.com/file/d/1kgpFMnw2h8duU7ts2DHFj3Ksewovv5cb/view?usp=sharing) and [hg19.bed.gz.tbi](https://drive.google.com/file/d/1yzb4K9J7ignDBCWzNBDJJmJpSpn886c5/view?usp=sharing)). Keep them in the same folder as the reference files. @@ -140,10 +150,10 @@ output_file_collection.append("SNVCalling/FinalMAF/final.maf") ## Reference -``` -1. Ji, S., Zhu, T., Sethia, A., Wang, W. (2023) 'Accelerated somatic mutation calling for whole-genome and whole-exome sequencing data from heterogenous tumor samples', bioRxiv.2023.07.04.547569. doi: https://doi.org/10.1101/2023.07.04.547569. -2. Kim, S. et al. (2018) 'Strelka2: fast and accurate calling of germline and somatic variants', Nature Methods. Nature Publishing Group, 15(8), pp. 591-594. doi: 10.1038/s41592-018-0051-x. +1. Ji S, Zhu T, Sethia A, **Wang W**. Accelerated somatic mutation calling for whole-genome and whole-exome sequencing data from heterogenous tumor samples. Genome Res. 2024 May 3;. doi: 10.1101/gr.278456.123. + +2. Kim S, et al. Strelka2: fast and accurate calling of germline and somatic variants. Nature Methods. 2018 Aug 15. 591-594. doi: 10.1038/s41592-018-0051-x. + +3. McLaren W, et al. The Ensembl Variant Effect Predictor. Genome Biology. 2016 Jun 6. 1-14. doi: 10.1186/S13059-016-0974-4/TABLES/8. -3. McLaren, W. et al. (2016) 'The Ensembl Variant Effect Predictor', Genome biology, 17(1), pp. 1-14. doi: 10.1186/S13059-016-0974-4/TABLES/8. -``` diff --git a/README.md b/README.md index 72c9d67..e29147d 100644 --- a/README.md +++ b/README.md @@ -3,13 +3,17 @@ An accurate and ultra-fast somatic mutation calling tool for whole-genome sequencing (WGS) and whole-exome sequencing (WES) data from heterogeneous tumor samples. This tool is unique in accounting for tumor heterogeneity using a sample-specific error model that improves sensitivity and specificity in mutation calling from sequencing data. The latest version of this software is **v2.1**. ## News -**We are excited to announce the launch of an automated pipeline designed for rapid consensus mutation calling, MuSE.Snakemake. This pipeline also includes both pre- and post-preprocessing stages, eliminating the needs for the manual curation of each task. [Check it out!](https://github.com/wwylab/MuSE/blob/master/MuSE.Snakemake/README.md)** + +- **We are thrilled to share that MuSE 2 is published on Genome Research.** Find the paper at [https://genome.cshlp.org/content/early/2024/05/03/gr.278456.123.long](https://genome.cshlp.org/content/early/2024/05/03/gr.278456.123.long). + +- **We are excited to announce the launch of an automated pipeline designed for rapid consensus mutation calling, MuSE.Snakemake.** This pipeline starts with BAM or FASTQ files from tumor-normal pairs of a cancer patient cohort, followed by preprocessing stages for the sequencing reads and the intersection of calls from MuSE 2 and [Strelka2](https://github.com/Illumina/strelka). It also includes postprocessing stages for the read depth adjustment and functional annotation for the calls. This pipeline is optimized for High-Performance Computing environments, reducing manual task curation and complexity, thereby making genetic variant analysis accessible to users like clinicians without deep bioinformatics expertise. Please visit the [README of MuSE.Snakemake](https://github.com/wwylab/MuSE/blob/master/MuSE.Snakemake/README.md) for tutorial. + ## Introduction Detection of somatic point mutations is a key component of cancer genomics research, which has been rapidly developing since next-generation sequencing (NGS) technology revealed its potential for describing genetic alterations in cancer. We previously launched MuSE 11, a statistical approach for mutation calling based on a Markov substitution model for molecular evolution. It has been used as a major contributing caller in a consensus calling strategy by the TCGA PanCanAtlas project2 and the ICGC Pan-Cancer Analysis of Whole Genomes (PCAWG) initiative3. -We have now released MuSE 24, which is powered by a multi-threaded producer-consumer model and an efficient way of memory allocation. MuSE 2 speeds up 50 times than MuSE 1 and 8-80 times than the other callers adopted in the Genomic Data Commons DNA-seq analysis pipeline, i.e., [MuTect2](https://gatk.broadinstitute.org/hc/en-us/articles/360037593851-Mutect2), [SomaticSniper](https://gmt.genome.wustl.edu/packages/somatic-sniper/) and [VarScan2](https://varscan.sourceforge.net/). MuSE 2 can reduce the computing time cost of a somatic mutation calling project from ∼40 hours to < 1 hour for WGS data, and from 2-4 hours to ~5 minutes for WES data, from each pair of tumor-normal samples. We also performed a benchmarking study, which suggests combining MuSE 2 and the recently accelerated [Strelka2](https://github.com/Illumina/strelka) can almost fully recover PCAWG consensus mutation calls (based on 4 popular callers), as well as recover a majority of the TCGA consensus mutation calls (based on 5 popular callers). Please find our preprint for more information at [https://www.biorxiv.org/content/10.1101/2023.07.04.547569v1](https://www.biorxiv.org/content/10.1101/2023.07.04.547569v1). +We have now released MuSE 24, which is powered by a multi-threaded producer-consumer model and an efficient way of memory allocation. MuSE 2 speeds up 50 times than MuSE 1 and 8-80 times than the other callers adopted in the Genomic Data Commons DNA-seq analysis pipeline, i.e., [MuTect2](https://gatk.broadinstitute.org/hc/en-us/articles/360037593851-Mutect2), [SomaticSniper](https://gmt.genome.wustl.edu/packages/somatic-sniper/) and [VarScan2](https://varscan.sourceforge.net/). MuSE 2 can reduce the computing time cost of a somatic mutation calling project from ∼40 hours to < 1 hour for WGS data, and from 2-4 hours to ~5 minutes for WES data, from each pair of tumor-normal samples. We also performed a benchmarking study, which suggests combining MuSE 2 and the recently accelerated [Strelka2](https://github.com/Illumina/strelka) can almost fully recover PCAWG consensus mutation calls (based on 4 popular callers), as well as recover a majority of the TCGA consensus mutation calls (based on 5 popular callers). ## Platform 1. MuSE 1 supports both Linux system and MacOS. @@ -22,6 +26,9 @@ cd MuSE ./install_muse.sh ``` The executable file `MuSE` will be generated in the same directory. + +A Docker file is also provided in the repository for building and running MuSE 2 in a Docker container. + ## Pre-processing Before running MuSE, raw WES/WGS data need to be processed with the following software, as outlined in the following flowchart. Please refer to GDC best practice guidelines (https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/DNA_Seq_Variant_Calling_Pipeline/) for a detailed description of the pre-processing pipeline. @@ -98,12 +105,12 @@ Please follow the [issue report template](https://github.com/wwylab/MuSE/blob/ma We thank Mehrzad Samadi and his team from Nvidia Corporation, including Tong Zhu, Timothy Harkins and Ankit Sethia, for their contributions towards implementing accelerating techniques in the ` MuSE call` step in MuSE2. ## Reference -``` -1. Fan, Y. et al. (2016) ‘MuSE: accounting for tumor heterogeneity using a sample-specific error model improves sensitivity and specificity in mutation calling from sequencing data’, Genome biology, 17(1), p. 178. doi: 10.1186/s13059-016-1029-6. -2. Ellrott, K. et al. (2018) ‘Scalable Open Science Approach for Mutation Calling of Tumor Exomes Using Multiple Genomic Pipelines’, Cell Systems. Cell Press, 6(3), pp. 271-281.e7. doi: 10.1016/j.cels.2018.03.002. +1. Fan Y, Xi L, Hughes DST, Zhang J, Zhang J, Futreal PA, Wheeler DA and **Wang W**. MuSE: accounting for tumor heterogeneity using a sample-specific error model improves sensitivity and specificity in mutation calling from sequencing data. Genome Biology. 2016. 17:178. doi: 10.1186/s13059-016-1029-6. -3. Campbell, P. J. et al. (2020) ‘Pan-cancer analysis of whole genomes’, Nature. Nature Publishing Group, 578(7793), pp. 82–93. doi: 10.1038/s41586-020-1969-6. +2. Ellrott K, Bailey MH, Saksena G, Covington KR, Kandoth C, Stewart C, Hess J, Ma S, Chiotti KE, McLellan MD, et al. Scalable Open Science Approach for Mutation Calling of Tumor Exomes Using Multiple Genomic Pipelines. Cell Systems. 2018. 271-281. doi: 10.1016/j.cels.2018.03.002. + +3. The ICGC/TCGA Pan-Cancer Analysis of Whole Genomes Consortium. 2020. Pan-cancer analysis of whole genomes. Nature 578: 82–93. doi: 10.1038/s41586-020-1969-6. + +4. Ji S, Zhu T, Sethia A, **Wang W**. Accelerated somatic mutation calling for whole-genome and whole-exome sequencing data from heterogenous tumor samples. Genome Res. 2024 May 3;. doi: 10.1101/gr.278456.123. -4. Ji, S., Zhu, T., Sethia, A., Wang, W. (2023) 'Accelerated somatic mutation calling for whole-genome and whole-exome sequencing data from heterogenous tumor samples', bioRxiv.2023.07.04.547569. doi: https://doi.org/10.1101/2023.07.04.547569. -```