-
Notifications
You must be signed in to change notification settings - Fork 2
Case 2
This example runs through SLURM on a linux based compute cluster.
If you want to follow this example, make sure to have the following softwares installed:
The sequence data used for this case is available under BioProject accession PRJNA510899 (SRA run accession SRR9290761).
Download the ONT MinION sequence data used for scaffolding:
wget -P ~/case_2/data https://sra-download.ncbi.nlm.nih.gov/traces/sra3/SRZ/009290/SRR9290761/CEPA-francisella-FSC200_20171107.fastq
The assembly being scaffolded in this example case can be downloaded from the doepipeline github repository: case_2_FSC200_assembly.fasta.gz.
wget -P ~/case_2/data https://github.com/clicumu/doepipeline/blob/master/supporting/case_2_FSC200_assembly.fasta.gz
Please make sure that the file paths and SLURM details in the YAML file are in line with your own setup.
case_2.yaml
# Specify the experimental design
design:
type: ccf
# Each factor to vary is specified here.
# Min/max: the smallest/largest values allowed
# Low_init/high_init: the initial design space min and max. Is only used when not running the GSD screening step in the beginning.
# Type: ordinal (integer), qualitative (categorical), quantitative (float)
factors:
# Factor 'ALEN' is varied between 0 and 5000
ALEN:
min: 0
max: 5000
low_init: 0
high_init: 1500
type: ordinal
# Factor 'GLEN' is varied between -3000 and 3000
GLEN:
min: -3000
max: 3000
low_init: -500
high_init: 500
type: ordinal
# Factor 'RRAT' is varied between 0.1 and 0.7
RRAT:
min: 0.1
max: 0.7
low_init: 0.3
high_init: 0.5
type: quantitative
# Factor 'IDEN' is varied between 30 and 90
IDEN:
min: 30
max: 90
low_init: 60
high_init: 80
type: ordinal
# Define the responses
responses:
# Response 'N50' represents the assembly quality in terms of contiguity
# Should be maximized
N50:
criterion: maximize
# Set the working directory (all output goes here)
working_directory: "~/case_2/doepipeline_output"
# The name of the results file (containing the response values). This file should be structured like:
# RESPONSE1,VALUE
# RESPONSE2,VALUE
results_file: "results.txt"
# Specify the names of the different pipeline steps to be run
pipeline:
- SSPACE
- CalculateResponse
# The first pipeline step
# Here, SSPACE-LongRead is executed with varying values for the parameters -a, -g, -r, and -i.
SSPACE:
# Define the script that should be run for this step.
script: >
perl ~/case_2/bin/SSPACE-LongRead_v1-1/SSPACE-LongRead.pl \
-c ~/case_2/data/case_2_FSC200_assembly.fasta.gz \
-p ~/case_2/data/CEPA-francisella-FSC200_20171107.fastq \
-t 2 \
-a {% ALEN %} \
-g {% GLEN %} \
-r {% RRAT %} \
-i {% IDEN %}
# Specify the factors that are used in this script.
# We substitute these factors in the script with the values given by the experimental design.
factors:
ALEN:
substitute: true
GLEN:
substitute: true
RRAT:
substitute: true
IDEN:
substitute: true
# If running through SLURM, specify the SLURM details you need.
SLURM:
A: PROJECT
c: 2
t: 00:40:00
o: case_2_slurm.out
e: case_2_slurm.error
# The second pipeline step
CalculateResponse:
# Here, two commands are string together. A custom script is used to format the response.
script: >
~/case_1/bin/seqstats/seqstats PacBio_scaffolder_results/scaffolds.fasta \
> SSPACE_output.scaffolds.fasta.seqstats && \
python ~/case_2/bin/case_2_scoring.py \
-f SSPACE_output.scaffolds.fasta.seqstats -o {% results_file %}
The following script (case_2_scoring.py) is used in the pipeline step CalculateResponse (see case_2.yaml above) to create the result file.
case_2_scoring.py
#!/usr/bin/env python
import os, sys, argparse
import pyfasta
parser = argparse.ArgumentParser(description='summarize responses from scaffolding')
parser.add_argument('-f', '--infile', help='result file from seqstats', required=True)
parser.add_argument('-o', '--outfile', help='doepipeline compatible result file', required=True)
args = parser.parse_args()
def read_result(result_file):
"""put each row of the result file as a key:value pair of a dictionary"""
result_dict = {}
with open(result_file, 'r') as f:
for row in f:
row = row.strip()
row = row.rstrip(" bp")
key, value = row.split(":")
value = value.strip()
result_dict[key] = value
return result_dict
def write_result(result_dict, outfile):
with open(outfile, 'w') as f:
f.write('N50,{}\n'.format(result_dict['N 50']))
result = read_result(args.infile)
write_result(result, args.outfile)
The following script (case_2.sbatch) is used to run doepipeline in a SLURM environment. Please make sure that the file paths and SLURM details are in line with your own setup.
case_2.sbatch
#!/bin/bash -l
#SBATCH -A PROJECT
#SBATCH -n 1
#SBATCH -t 06:00:00
#SBATCH -J case_2_doepipeline
#SBATCH -o case_2_doepipeline.out
#SBATCH -e case_2_doepipeline.error
set -e
set -x
# Changes here
_WORKDIR="~/case_2"
_OUTDIR="${_WORKDIR}/doepipeline_output"
_YAML="${_WORKDIR}/yaml/case_2.yaml"
_LOG="${_OUTDIR}/case_2.log"
_REDUCTION_FACTOR=8
_SHRINKAGE="0.9"
# Create output directory if needed
if [ ! -d ${_OUTDIR} ]; then
echo "Creating working directory: ${_OUTDIR}"
mkdir ${_OUTDIR}
fi
# Run doepipeline
doepipeline -e slurm -s ${_SHRINKAGE} -m greedy -l ${_LOG} ${_YAML}