Skip to content
Daniel Svensson edited this page Jun 25, 2019 · 11 revisions

Scaffolding of a bacterial genome assembly using long ONT MinION reads

Before you start

This example runs through SLURM on a linux based compute cluster.

If you want to follow this example, make sure to have the following softwares installed:

Data preparation

The sequence data used for this case is available under BioProject accession PRJNA510899 (SRA run accession SRR9290761).

Download the ONT MinION sequence data used for scaffolding:

wget -P ~/case_2/data https://sra-download.ncbi.nlm.nih.gov/traces/sra3/SRZ/009290/SRR9290761/CEPA-francisella-FSC200_20171107.fastq

The assembly being scaffolded in this example case can be downloaded from the doepipeline github repository: case_2_FSC200_assembly.fasta.gz.

wget -P ~/case_2/data https://github.com/clicumu/doepipeline/blob/master/supporting/case_2_FSC200_assembly.fasta.gz

YAML configuration file

Please make sure that the file paths and SLURM details in the YAML file are in line with your own setup.

case_2.yaml

# Specify the experimental design
design:
    type: ccf
    
    # Each factor to vary is specified here.
    # Min/max: the smallest/largest values allowed
    # Low_init/high_init: the initial design space min and max. Is only used when not running the GSD screening step in the beginning.
    # Type: ordinal (integer), qualitative (categorical), quantitative (float)
    factors:

        # Factor 'ALEN' is varied between 0 and 5000
        ALEN:
            min: 0
            max: 5000
            low_init: 0
            high_init: 1500
            type: ordinal

        # Factor 'GLEN' is varied between -3000 and 3000
        GLEN:
            min: -3000
            max: 3000
            low_init: -500
            high_init: 500
            type: ordinal

        # Factor 'RRAT' is varied between 0.1 and 0.7
        RRAT:
            min: 0.1
            max: 0.7
            low_init: 0.3
            high_init: 0.5
            type: quantitative

        # Factor 'IDEN' is varied between 30 and 90
        IDEN:
            min: 30
            max: 90
            low_init: 60
            high_init: 80
            type: ordinal

    # Define the responses
    responses:

        # Response 'N50' represents the assembly quality in terms of contiguity
        # Should be maximized
        N50:
            criterion: maximize

# Set the working directory (all output goes here)
working_directory: "~/case_2/doepipeline_output"

# The name of the results file (containing the response values). This file should be structured like:
# RESPONSE1,VALUE
# RESPONSE2,VALUE
results_file: "results.txt"

# Specify the names of the different pipeline steps to be run
pipeline:
    - SSPACE
    - CalculateResponse

# The first pipeline step
# Here, SSPACE-LongRead is executed with varying values for the parameters -a, -g, -r, and -i.
SSPACE:
    
    # Define the script that should be run for this step. 
    script: >
        perl ~/case_2/bin/SSPACE-LongRead_v1-1/SSPACE-LongRead.pl \
            -c ~/case_2/data/case_2_FSC200_assembly.fasta.gz \
            -p ~/case_2/data/CEPA-francisella-FSC200_20171107.fastq \
            -t 2 \
            -a {% ALEN %} \
            -g {% GLEN %} \
            -r {% RRAT %} \
            -i {% IDEN %}

    # Specify the factors that are used in this script.
    # We substitute these factors in the script with the values given by the experimental design.
    factors:
        ALEN:
            substitute: true
        GLEN:
            substitute: true
        RRAT:
            substitute: true
        IDEN:
            substitute: true

    # If running through SLURM, specify the SLURM details you need.
    SLURM:
        A: PROJECT
        c: 2
        t: 00:40:00
        o: case_2_slurm.out
        e: case_2_slurm.error

# The second pipeline step
CalculateResponse:

    # Here, two commands are string together. A custom script is used to format the response.
    script: >
        ~/case_1/bin/seqstats/seqstats PacBio_scaffolder_results/scaffolds.fasta \
            > SSPACE_output.scaffolds.fasta.seqstats && \
        python ~/case_2/bin/case_2_scoring.py \
            -f SSPACE_output.scaffolds.fasta.seqstats -o {% results_file %}

Scripts

The following script (case_2_scoring.py) is used in the pipeline step CalculateResponse (see case_2.yaml above) to create the result file.

case_2_scoring.py

#!/usr/bin/env python

import os, sys, argparse
import pyfasta

parser = argparse.ArgumentParser(description='summarize responses from scaffolding')
parser.add_argument('-f', '--infile', help='result file from seqstats', required=True)
parser.add_argument('-o', '--outfile', help='doepipeline compatible result file', required=True)
args = parser.parse_args()

def read_result(result_file):
    """put each row of the result file as a key:value pair of a dictionary"""
    result_dict = {}
    with open(result_file, 'r') as f:
        for row in f:
            row = row.strip()
            row = row.rstrip(" bp")
            key, value = row.split(":")
            value = value.strip()
            result_dict[key] = value
    return result_dict

def write_result(result_dict, outfile):
    with open(outfile, 'w') as f:
        f.write('N50,{}\n'.format(result_dict['N 50']))

result = read_result(args.infile)
write_result(result, args.outfile)

Executing doepipeline

The following script (case_2.sbatch) is used to run doepipeline in a SLURM environment. Please make sure that the file paths and SLURM details are in line with your own setup.

case_2.sbatch

#!/bin/bash -l

#SBATCH -A PROJECT
#SBATCH -n 1
#SBATCH -t 06:00:00
#SBATCH -J case_2_doepipeline
#SBATCH -o case_2_doepipeline.out
#SBATCH -e case_2_doepipeline.error

set -e
set -x

# Changes here
_WORKDIR="~/case_2"
_OUTDIR="${_WORKDIR}/doepipeline_output"
_YAML="${_WORKDIR}/yaml/case_2.yaml"
_LOG="${_OUTDIR}/case_2.log"
_REDUCTION_FACTOR=8
_SHRINKAGE="0.9"

# Create output directory if needed
if [ ! -d ${_OUTDIR} ]; then
  echo "Creating working directory: ${_OUTDIR}"
  mkdir ${_OUTDIR}
fi

# Run doepipeline
doepipeline -e slurm -s ${_SHRINKAGE} -m greedy -l ${_LOG} ${_YAML}