Skip to content
Daniel Svensson edited this page Jun 24, 2019 · 15 revisions

de-novo assembly of a bacterial genome

Before you start

This example runs through SLURM on a linux based compute cluster.

If you want to follow this example, make sure to have the following softwares installed:

Data preparation

The sequence data used for this case is available under BioProject accession PRJNA16087 (SRA run accession SRR518502).

Download sequence data from the SRA using the SRA-toolkit:

fasterq-dump --outdir ~/case_1/data SRR518502

Subsample the read set to approximately 100X coverage using seqtk:

cd ~/case_1/data
seqtk sample -s100 SRR518502_1.fastq 950000 > SRR518502_sample_1.fastq
seqtk sample -s100 SRR518502_2.fastq 950000 > SRR518502_sample_2.fastq

YAML configuration file

Please make sure that the file paths and SLURM details in the YAML file are in line with your own setup.

case_1.yaml

# Specify the experimental design
design:
    type: ccf

    # Each factor to vary is specified here.
    # Min/max: the smallest/largest values allowed
    # Low_init/high_init: the initial design space min and max. Is only used when not running the GSD screening step in the beginning.
    # Type: ordinal (integer), qualitative (categorical), quantitative (float)
    factors:

        # Factor 'KMER' is varied between 20 and 90
        KMER:
            min: 20
            max: 90
            low_init: 10
            high_init: 50
            type: ordinal

        # Factor 'MIKC' is varied between 2 and 15
        MIKC:
            min: 2
            max: 15
            low_init: 4
            high_init: 8
            type: quantitative

        # Factor 'MIAL' is varied between 20 and 60
        MIAL:
            min: 20
            max: 60
            low_init: 30
            high_init: 50
            type: ordinal

        # Factor 'MIPA' is varied between 5 and 15
        MIPA:
            min: 5
            max: 15
            low_init: 8
            high_init: 12
            type: ordinal

    # Define the responses
    responses:

        # Response 'tSeq' represents the total sequence in the assembly
        # Should be maximized
        # Min accepted value is 1,850,000 bp
        # The goal is the reference genome size (1,894,157 bp)
        tSeq:
            criterion: maximize
            low_limit: 1850000
            target: 1894157

        # Response 'nSeq' represents how many contigs are in the assembly
        # Should be minimized
        # Max accepted value is 100
        # Goal is to have 90 contigs
        nSeq:
            criterion: minimize
            high_limit: 100
            target: 90

        # Response 'N50' represents the assembly quality in terms of contiguity
        # Should be maximized
        # Should at least be 30,000 bp
        # Goal is 35,000 bp
        N50:
            criterion: maximize
            low_limit: 30000
            target: 35000

# Set the working directory (all output goes here)
working_directory: "~/case_1/doepipeline_output"

# The name of the results file (containing the response values). This file should be structured like:
# RESPONSE1,VALUE
# RESPONSE2,VALUE
results_file: "results.txt"

# Specify the names of the different pipeline steps to be run
pipeline:
    - ABYSS
    - CalculateResponse

# The first pipeline step
# Here, abyss-pe is executed whilst varying the four parameters k, c, l, and n.
ABYSS:

    # Define the script that should be run for this step
    script: >
        abyss-pe \
            k={% KMER %} \
            c={% MIKC %} \
            l={% MIAL %} \
            n={% MIPA %} \
            j=14 \
            name=case_1 \
            in="~/case_1/data/SRR518502_sample_1.fastq ~/case_1/data/SRR518502_sample_1.fastq"

    # Specify the factors that are used in this script.
    # We substitute these factors in the script with the values given by the experimental design.
    factors:
        KMER:
            substitute: true
        MIKC:
            substitute: true
        MIAL:
            substitute: true
        MIPA:
            substitute: true

    # If running through SLURM, specify the SLURM details you need.
    SLURM:
        A: PROJECT
        n: 14
        t: 00:15:00
        o: case_1_slurm.out
        e: case_1_slurm.error

# The second pipeline step
CalculateResponse:

    # Define the script that should be run for this step
    script: >
        ~/case_1/bin/Fastaq/scripts/fastaq filter --min_length 1000 case_1-scaffolds.fa case_1_l1000-scaffolds.fasta;
        ~/case_1/bin/seqstats/seqstats case_1_l1000-scaffolds.fasta > case_1_l1000-scaffolds_seqstats.fasta;
        python ~/case_1/bin/case_1_scoring.py \
          -f case_1_l1000-scaffolds_seqstats.fasta \
          -o {% results_file %};

Scripts

The following script (case_1_scoring.py) is used in the pipeline step CalculateResponse (see case_1.yaml) to create the result file.

case_1_scoring.py

#!/usr/bin/env python

import os, sys, argparse
import pyfasta

parser = argparse.ArgumentParser(description='summarize responses from assembly')
parser.add_argument('-f', '--infile', help='result file from seqstats', required=True)
parser.add_argument('-o', '--outfile', help='doepipeline compatible result file', required=True)
args = parser.parse_args()

def read_result(result_file):
    """put each row of the result file as a key:value pair of a dictionary"""
    result_dict = {}
    with open(result_file, 'r') as f:
        for row in f:
            row = row.strip()
            row = row.rstrip(" bp")
            key, value = row.split(":")
            value = value.strip()
            result_dict[key] = value
    return result_dict

def write_result(result_dict, outfile):
    with open(outfile, 'w') as f:
        f.write('tSeq,{}\n'.format(result_dict['Total seq']))
        f.write('nSeq,{}\n'.format(result_dict['Total n']))
        f.write('N50,{}\n'.format(result_dict['N 50']))

result = read_result(args.infile)
write_result(result, args.outfile)

Executing doepipeline

The following script (case_1.sbatch) is used to run doepipeline in a SLURM environment. Please make sure that the file paths and SLURM details are in line with your own setup.

case_1.sbatch

#!/bin/bash -l

#SBATCH -A PROJECT
#SBATCH -n 1
#SBATCH -t 05:00:00
#SBATCH -J case_1_doepipeline
#SBATCH -o case_1_doepipeline.out
#SBATCH -e case_1_doepipeline.error

set -e
set -x

# Changes here
_WORKDIR="~/case_1"
_OUTDIR="${_WORKDIR}/doepipeline_output"
_YAML="${_WORKDIR}/yaml/case_1.yaml"
_LOG="${_OUTDIR}/case_1.log"
_SCREENING_REDUCTION="8"
_SHRINKAGE="0.9"


# Create output directory if needed
if [ ! -d ${_OUTDIR} ]; then
  echo "Creating working directory: ${_OUTDIR}"
  mkdir ${_OUTDIR}
fi

# Run doepipeline
doepipeline -e slurm -s ${_SHRINKAGE} --screening_reduction ${_SCREENING_REDUCTION} -m greedy -l ${_LOG} ${_YAML}