Read recruitment pipeline for comparison of sag abundance across metagenomes
python 2.7
Flash
bwa
samtools
bedtools
To install with conda:
# create conda virtual env
conda create --name smr python=2.7
# activate environment
source activate smr
# install dependencies using conda
conda install -c bioconda flash bwa bedtools samtools pysam biopython
conda install matplotlib pandas numpy
# optional if you plan on using the checkM functionality:
conda install -c bioconda checkm
# download this repo
wget https://github.com/BigelowLab/sag-mg-recruit/archive/master.zip
# unarchive repo
unzip master.zip
# navigate into repo and run install script
cd sag-mg-recruit-master
python setup.py install
# to leave sag-mg-recruit environment
source deactivate smr
Whenever you want to run sag-mg-recruit, activate your environment:
source activate smr
For instructions on how to run type:
sag-mg-recruit --help
which will return:
$ sag-mg-recruit --help
Usage: sag-mg-recruit [OPTIONS] INPUT_MG_TABLE INPUT_SAG_TABLE
Options:
--outdir TEXT directory location to place output files
--cores INTEGER number of cores to run on [default: 8]
--mmd FLOAT for join step: mismatch density [default: 0.05]
--mino INTEGER for join step: minimum overlap [default: 35]
--maxo INTEGER for join step: maximum overlap [default: 150]
--minlen INTEGER for alignment and mg read count: minimum alignment
length to include; minimum read size to include
[default: 150]
--pctid INTEGER for alignment: minimum percent identity to keep
within overlapping region [default: 95]
--overlap INTEGER for alignment: percent read that must overlap with
reference sequence to keep [default: 0]
--log TEXT name of log file, else, log sent to standard out
--concatenate BOOLEAN include concatenated SAG in analysis [default: True]
--checkm BOOLEAN should checkm be run on the SAGs? [default: True]
--keep_coverage if you want to keep the genome coverage table (large)
-h, --help Show this message and exit.
input_mg_table
should be a csv file created using the mg_template.xlsx within the program directory. The columns are:
- name: desired name used to refer to each metagenome; any string without spaces, special characters or "."
- mg_f: path to forward reads for metagenome
- mg_r: path to reverse metagenomic reads (if there are none, put "None" in this column)
- wgs_technology: specify whether the library was sequenced with either illumina (enter "illumina") or 454-pyrosequencing (enter "pyro")
- join: designate whether you want the metagenome to be joined or not; either True or False
See mg_template.xlsx for example table. Input your own information and save as a .csv file.
intput_sag_table
is a csv formatted table with the following columns:
- sag_name: any string with no spaces or '.'
- fasta_file: None if sag should be masked, otherwise path to input fasta to be processed
- gbk_file: path to SAG's annotated gbk file if mask = True
- mask: boolean indicating whether the SAG should have the 16/23S sequences masked (TRUE) or not (FALSE)
See sag_template.xlsx for example table. Input your own information in excel and save as a .csv file.
for help:
sag-mg-recruit -h
to run with 95% identity alignment, 40 cores, minimum read length of 100:
sag-mg-recruit --outdir <path to output dir> --cores 40 --minlen 100 --pctid 95 --log recruitment.log <input mg table> <input sag table>
The program will create a new output directory with three sub-directories:
coverage
, mgs
and sags
which will contain various output files created during the recruitment process.
The final output table will be located in the main output directory, called "summary_table_pctidXX_minlenXX_overlapXX.txt".
That table has one row per mg-sag pair with the following columns:
- sag: SAG name
- metagenome: metagenome name
- Percent_scaffolds_with_any_coverage: percent of SAG scaffolds with any coverage
- Percent_of_reference_bases_covered: percent of SAG bases covered by at least one read
- Average_coverage: mean coverage across the SAG
- total_reads_recruited: total number of metagenomic reads recruited to the SAG
- mg_wgs_technology: metagenome sequencing technology used, designated in the input_mg_table
- mg_read_count: total number of reads used in the processed metagenome
- sag_completeness: % sag completeness as determined by checkm
- sag_total_bp: number of bp in input SAG fasta file
- sag_size_mbp: SAG size in megabasepairs (mbp)
- reads_per_mbp: number of reads recruited per SAG mbp
- prop_mgreads_per_mbp: proprtion metagenomic reads recruited per SAG mbp
After you've run sag-mg-recruit, you can re-run the analysis with changed --pctid
, --overlap
, or --minlen
parameters much more quickly than the original run by using the same input files, the same output directory but changing the --pctid
, --overlap
and/or --minlen
parameters.
For example, say your first run was:
sag-mg-recruit --outdir /home/smroutput --cores 40 --minlen 150 --pctid 95 --overlap 95 --log recruitment.log mgtbl.txt sagtbl.txt
Re-run with different minlen and pctid parameters:
sag-mg-recruit --outdir /home/smroutput --cores 40 --minlen 100 --pctid 90 --overlap 100 --log recruitment.log mgtbl.txt sagtbl.txt
Note a quirk of one of the dependencies (FLASH) which is used for the join step: if there are metagenomic reads that need to be joined, make sure the output directory is specificied as a relative rather than absolute path.
Other scripts can be found in smr/scripts/
extract_smr_reads.py: Extracts unaligned and aligned reads from smr run, and writes to the smr run's coverage directory in fastq.gz format. Like the main sag-mg-recruit software, this can be run multiple times with different parameters. Parameters are included in the output file name.
Usage: extract_smr_reads.py [OPTIONS] SMR_DIR
Options:
--minlen INTEGER for alignment and mg read count: minimum alignment length
to include; minimum read size to include [default: 150]
--pctid INTEGER for alignment: minimum percent identity to keep within
overlapping region [default: 95]
--overlap INTEGER for alignment: percent read that must overlap with
reference sequence to keep [default: 0]
--cores INTEGER how many cores [default: 10]
-h, --help Show this message and exit.
Example command: extract_smr_reads.py --minlen 150 --pctid 95 /smr/output/directory/