Skip to content

Commit

Permalink
Update README
Browse files Browse the repository at this point in the history
Includes descriptions about the acceptance of MuSE 2 by Genome Research.
  • Loading branch information
jiyunmaths committed May 8, 2024
1 parent 7022d7b commit b7c4bbd
Show file tree
Hide file tree
Showing 2 changed files with 32 additions and 15 deletions.
24 changes: 17 additions & 7 deletions MuSE.Snakemake/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,6 @@ We suggest to download them from the Broad Institute Resource Bundle, and save t
- Homo_sapiens_assembly38.dict
- Homo_sapiens_assembly38.fasta.fai
- Homo_sapiens_assembly38.dbsnp138.vcf
- Homo_sapiens_assembly38.dbsnp138.vcf.idx
- Mills_and_1000G_gold_standard.indels.hg38.vcf.gz
- Mills_and_1000G_gold_standard.indels.hg38.vcf.gz.tbi

Expand All @@ -42,10 +41,21 @@ We suggest to download them from the Broad Institute Resource Bundle, and save t
- Homo_sapiens_assembly19.dict
- Homo_sapiens_assembly19.fasta.fai
- Homo_sapiens_assembly19.dbsnp.vcf
- Homo_sapiens_assembly19.dbsnp.vcf.idx
- Mills_and_1000G_gold_standard.indels.b37.vcf.gz
- Mills_and_1000G_gold_standard.indels.b37.vcf.gz.tbi

**Note:** For Homo_sapiens_assembly38.dbsnp138.vcf and Homo_sapiens_assembly19.dbsnp.vcf, please use the following commands to compress and index:

```
bgzip -c Homo_sapiens_assembly38.dbsnp138.vcf > Homo_sapiens_assembly38.dbsnp138.vcf.gz
tabix -p vcf Homo_sapiens_assembly38.dbsnp138.vcf.gz
```
or
```
bgzip -c Homo_sapiens_assembly19.dbsnp.vcf > Homo_sapiens_assembly19.dbsnp.vcf.gz
tabix -p vcf Homo_sapiens_assembly19.dbsnp.vcf.gz
```

Additionaly, Strelka2 requires a bed file to specific the contigs to call mutations. One can download it from here: hg38 (download both [hg38.bed.gz](https://drive.google.com/file/d/1vrZuTrkRfP6e1agexpHJdST-JZpRmpjc/view?usp=sharing) and [hg38.bed.gz.tbi](https://drive.google.com/file/d/1PXq-AnqUmZHNfPpxfMwFed0D3TkU6pOS/view?usp=sharing)), hg19 (download both [hg19.bed.gz](https://drive.google.com/file/d/1kgpFMnw2h8duU7ts2DHFj3Ksewovv5cb/view?usp=sharing) and [hg19.bed.gz.tbi](https://drive.google.com/file/d/1yzb4K9J7ignDBCWzNBDJJmJpSpn886c5/view?usp=sharing)). Keep them in the same folder as the reference files.


Expand Down Expand Up @@ -140,10 +150,10 @@ output_file_collection.append("SNVCalling/FinalMAF/final.maf")


## Reference
```
1. Ji, S., Zhu, T., Sethia, A., Wang, W. (2023) 'Accelerated somatic mutation calling for whole-genome and whole-exome sequencing data from heterogenous tumor samples', bioRxiv.2023.07.04.547569. doi: https://doi.org/10.1101/2023.07.04.547569.

2. Kim, S. et al. (2018) 'Strelka2: fast and accurate calling of germline and somatic variants', Nature Methods. Nature Publishing Group, 15(8), pp. 591-594. doi: 10.1038/s41592-018-0051-x.
1. <ins>Ji S</ins>, Zhu T, Sethia A, **Wang W**. Accelerated somatic mutation calling for whole-genome and whole-exome sequencing data from heterogenous tumor samples. Genome Res. 2024 May 3;. doi: 10.1101/gr.278456.123.

2. Kim S, et al. Strelka2: fast and accurate calling of germline and somatic variants. Nature Methods. 2018 Aug 15. 591-594. doi: 10.1038/s41592-018-0051-x.

3. McLaren W, et al. The Ensembl Variant Effect Predictor. Genome Biology. 2016 Jun 6. 1-14. doi: 10.1186/S13059-016-0974-4/TABLES/8.

3. McLaren, W. et al. (2016) 'The Ensembl Variant Effect Predictor', Genome biology, 17(1), pp. 1-14. doi: 10.1186/S13059-016-0974-4/TABLES/8.
```
23 changes: 15 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,13 +3,17 @@
An accurate and ultra-fast somatic mutation calling tool for whole-genome sequencing (WGS) and whole-exome sequencing (WES) data from heterogeneous tumor samples. This tool is unique in accounting for tumor heterogeneity using a sample-specific error model that improves sensitivity and specificity in mutation calling from sequencing data. The latest version of this software is **v2.1**.

## News
**We are excited to announce the launch of an automated pipeline designed for rapid consensus mutation calling, MuSE.Snakemake. This pipeline also includes both pre- and post-preprocessing stages, eliminating the needs for the manual curation of each task. [Check it out!](https://github.com/wwylab/MuSE/blob/master/MuSE.Snakemake/README.md)**

- **We are thrilled to share that MuSE 2 is published on Genome Research.** Find the paper at [https://genome.cshlp.org/content/early/2024/05/03/gr.278456.123.long](https://genome.cshlp.org/content/early/2024/05/03/gr.278456.123.long).

- **We are excited to announce the launch of an automated pipeline designed for rapid consensus mutation calling, MuSE.Snakemake.** This pipeline starts with BAM or FASTQ files from tumor-normal pairs of a cancer patient cohort, followed by preprocessing stages for the sequencing reads and the intersection of calls from MuSE 2 and [Strelka2](https://github.com/Illumina/strelka). It also includes postprocessing stages for the read depth adjustment and functional annotation for the calls. This pipeline is optimized for High-Performance Computing environments, reducing manual task curation and complexity, thereby making genetic variant analysis accessible to users like clinicians without deep bioinformatics expertise. Please visit the [README of MuSE.Snakemake](https://github.com/wwylab/MuSE/blob/master/MuSE.Snakemake/README.md) for tutorial.


## Introduction

Detection of somatic point mutations is a key component of cancer genomics research, which has been rapidly developing since next-generation sequencing (NGS) technology revealed its potential for describing genetic alterations in cancer. We previously launched MuSE 1<sup>1</sup>, a statistical approach for mutation calling based on a Markov substitution model for molecular evolution. It has been used as a major contributing caller in a consensus calling strategy by the TCGA PanCanAtlas project<sup>2</sup> and the ICGC Pan-Cancer Analysis of Whole Genomes (PCAWG) initiative<sup>3</sup>.

We have now released MuSE 2<sup>4</sup>, which is powered by a multi-threaded producer-consumer model and an efficient way of memory allocation. MuSE 2 speeds up 50 times than MuSE 1 and 8-80 times than the other callers adopted in the Genomic Data Commons DNA-seq analysis pipeline, i.e., [MuTect2](https://gatk.broadinstitute.org/hc/en-us/articles/360037593851-Mutect2), [SomaticSniper](https://gmt.genome.wustl.edu/packages/somatic-sniper/) and [VarScan2](https://varscan.sourceforge.net/). MuSE 2 can reduce the computing time cost of a somatic mutation calling project from ∼40 hours to < 1 hour for WGS data, and from 2-4 hours to ~5 minutes for WES data, from each pair of tumor-normal samples. We also performed a benchmarking study, which suggests combining MuSE 2 and the recently accelerated [Strelka2](https://github.com/Illumina/strelka) can almost fully recover PCAWG consensus mutation calls (based on 4 popular callers), as well as recover a majority of the TCGA consensus mutation calls (based on 5 popular callers). Please find our preprint for more information at [https://www.biorxiv.org/content/10.1101/2023.07.04.547569v1](https://www.biorxiv.org/content/10.1101/2023.07.04.547569v1).
We have now released MuSE 2<sup>4</sup>, which is powered by a multi-threaded producer-consumer model and an efficient way of memory allocation. MuSE 2 speeds up 50 times than MuSE 1 and 8-80 times than the other callers adopted in the Genomic Data Commons DNA-seq analysis pipeline, i.e., [MuTect2](https://gatk.broadinstitute.org/hc/en-us/articles/360037593851-Mutect2), [SomaticSniper](https://gmt.genome.wustl.edu/packages/somatic-sniper/) and [VarScan2](https://varscan.sourceforge.net/). MuSE 2 can reduce the computing time cost of a somatic mutation calling project from ∼40 hours to < 1 hour for WGS data, and from 2-4 hours to ~5 minutes for WES data, from each pair of tumor-normal samples. We also performed a benchmarking study, which suggests combining MuSE 2 and the recently accelerated [Strelka2](https://github.com/Illumina/strelka) can almost fully recover PCAWG consensus mutation calls (based on 4 popular callers), as well as recover a majority of the TCGA consensus mutation calls (based on 5 popular callers).

## Platform
1. MuSE 1 supports both Linux system and MacOS.
Expand All @@ -22,6 +26,9 @@ cd MuSE
./install_muse.sh
```
The executable file `MuSE` will be generated in the same directory.

A Docker file is also provided in the repository for building and running MuSE 2 in a Docker container.

## Pre-processing
Before running MuSE, raw WES/WGS data need to be processed with the following software, as outlined in the following flowchart. Please refer to GDC best practice guidelines (https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/DNA_Seq_Variant_Calling_Pipeline/) for a detailed description of the pre-processing pipeline.

Expand Down Expand Up @@ -98,12 +105,12 @@ Please follow the [issue report template](https://github.com/wwylab/MuSE/blob/ma
We thank Mehrzad Samadi and his team from Nvidia Corporation, including Tong Zhu, Timothy Harkins and Ankit Sethia, for their contributions towards implementing accelerating techniques in the ` MuSE call` step in MuSE2.

## Reference
```
1. Fan, Y. et al. (2016) ‘MuSE: accounting for tumor heterogeneity using a sample-specific error model improves sensitivity and specificity in mutation calling from sequencing data’, Genome biology, 17(1), p. 178. doi: 10.1186/s13059-016-1029-6.

2. Ellrott, K. et al. (2018) ‘Scalable Open Science Approach for Mutation Calling of Tumor Exomes Using Multiple Genomic Pipelines’, Cell Systems. Cell Press, 6(3), pp. 271-281.e7. doi: 10.1016/j.cels.2018.03.002.
1. <ins>Fan Y</ins>, Xi L, Hughes DST, Zhang J, Zhang J, Futreal PA, Wheeler DA and **Wang W**. MuSE: accounting for tumor heterogeneity using a sample-specific error model improves sensitivity and specificity in mutation calling from sequencing data. Genome Biology. 2016. 17:178. doi: 10.1186/s13059-016-1029-6.

3. Campbell, P. J. et al. (2020) ‘Pan-cancer analysis of whole genomes’, Nature. Nature Publishing Group, 578(7793), pp. 82–93. doi: 10.1038/s41586-020-1969-6.
2. Ellrott K, Bailey MH, Saksena G, Covington KR, Kandoth C, Stewart C, Hess J, Ma S, Chiotti KE, McLellan MD, et al. Scalable Open Science Approach for Mutation Calling of Tumor Exomes Using Multiple Genomic Pipelines. Cell Systems. 2018. 271-281. doi: 10.1016/j.cels.2018.03.002.

3. The ICGC/TCGA Pan-Cancer Analysis of Whole Genomes Consortium. 2020. Pan-cancer analysis of whole genomes. Nature 578: 82–93. doi: 10.1038/s41586-020-1969-6.

4. <ins>Ji S</ins>, Zhu T, Sethia A, **Wang W**. Accelerated somatic mutation calling for whole-genome and whole-exome sequencing data from heterogenous tumor samples. Genome Res. 2024 May 3;. doi: 10.1101/gr.278456.123.

4. Ji, S., Zhu, T., Sethia, A., Wang, W. (2023) 'Accelerated somatic mutation calling for whole-genome and whole-exome sequencing data from heterogenous tumor samples', bioRxiv.2023.07.04.547569. doi: https://doi.org/10.1101/2023.07.04.547569.
```

0 comments on commit b7c4bbd

Please sign in to comment.