Skip to content
Young edited this page May 2, 2023 · 6 revisions

A custom mash reference

The mash reference file /db/RefSeqSketchesDefaults.msh found in the staphb/mash:2.3 Docker image is from RefSeq version 77. There is nothing particularity wrong about this file, but RefSeq version 216 came was released January 13, 2023. Over time, the names of organisms may change as well as species boundaries. RefSeq, however, continues to grow with each release, and it is not feasible to contain a current mash reference file in this repository or in a container for use.

Downloading from Zenodo

A more-current mash reference file prepared for Grandeur has been uploaded to Zenodo and can be downloaded via a browser or from the command line with

wget https://zenodo.org/record/7887021/files/rep-genomes.msh

Then set the params.mash_db parameter to your new file on the command line or in a config file.

params.mash_db = "/path/to/rep-genomes.msh"

This file was created with mash and datasets with Grandeur/bin/new_mash_ref.sh

# getting the ids for representative genomes
datasets summary genome taxon bacteria --reference --as-json-lines | \
  dataformat tsv genome --fields accession,assminfo-refseq-category,organism-name --elide-header | \
  grep representative | \
  tee representative_genomes.txt | \
  cut -f 1 > genome_ids.txt

# downloading genomes
datasets download genome accession --inputfile genome_ids.txt --filename rep-genomes.zip

# extracting genomes
unzip rep-genomes.zip

# combining genomes
cat  ncbi_dataset/data/*/*.fna  | sed 's/ /_/g' | sed 's/,//g' > rep-genomes.fasta

# sketching genomes
mash sketch -i -p 20 rep-genomes.fasta -o rep-genomes

Grandeur parses the output file from mash dist and is particular about genus and species being at the beginning of the id line (i.e. > ${genus}_${species}_...), so there may be compatibility issues with other mash references.

More information about this and other RefSeq released can be found at https://ftp.ncbi.nlm.nih.gov/refseq/release/release-notes/.

Clone this wiki locally