Skip to content

Here, I explain how to parallelize sequence similarity search using blast (or anything similar), if you have access to HPC or a big computer.

License

Notifications You must be signed in to change notification settings

dadrasarmin/high-throughput_blast

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 

Repository files navigation

high-throughput BLAST

Note: This is an updated version that works with snakemake v8 and higher.

Idea

Instead of performing thousands of blast search sequentially, if you have access to HPC or a computer with big RAM and CPU, you can parallelize the process and reduce the time needed. For example, it will be possible to do blast for 44,092 protein sequences in around 48 hours and 05 minutes (highly depend on your resources, sequence length, database size, etc. Just consider it as an estimation.). Let's say each blast query took 5 seconds if you do it in sequential way, for the same number of sequence it would be 61 hours and 13 minutes (30%) longer. This test was run with -s 1 which is not the best choice. I suggest -s 100 or similar numbers. The idea is to split the input sequence file into multiple pieces and run the BLAST search simultaneously on each of the split file.

Requirements

In this file, I assume you are using HPC and slurm as workload manager.

Installation

You can set it up whatever you like. I assume you have conda installed and explain everything after that.

Installing snakemake, blast and seqkit using mamba (which is very similar to conda). For snakemake v8+ also install the snakemake executor plugin to run it on a cluster.

conda create -n high_throughput_blast -c conda-forge mamba
conda activate high_throughput_blast
mamba install -c conda-forge -c bioconda -n high_throughput_blast snakemake blast seqkit
pip install snakemake-executor-plugin-cluster-generic

Protocol

  1. I assume you have already moved your protein sequence file that you want to use as query to a server. You also have a set of sequences that you want to use as local database. Here, I use Arabidopsis thaliana 1 protein sequences to make the database and Mesotaenium endlicherianum 2 protein sequences as my query. If the sequence IDs contain extra information beside IDs (separated by space or other delimeters), please remove them. Sometimes they cause error. You can do that via a simple regex.

Feel free to change sequence type, database, query, blast settings, etc.

  1. Create a local blast database. You can do it via a bash file like this.
#!/bin/bash
#SBATCH -p medium # this is a partition I want to use. If you do not need it remove this.
#SBATCH -C scratch # this is a filesystem I want to access. 
#SBATCH -N 1
#SBATCH -c 4
#SBATCH --mem=16G
#SBATCH -t 02:00:00
#SBATCH -o job-%J.out

# You should give the absolute path to your fasta file to make a local database.
# You should to execute this code in the same folder as fasta file if you prefer to use relative paths
makeblastdb -in Athaliana_447_Araport11.protein_primaryTranscriptOnly.fa -dbtype prot
  1. Split query fasta files into many pieces using seqkit.
#!/bin/bash
#SBATCH -p medium # this is a partition I want to use. If you do not need it remove this.
#SBATCH -C scratch # this is a filesystem I want to access. 
#SBATCH -N 1
#SBATCH -c 4
#SBATCH --mem=16G
#SBATCH -t 02:00:00
#SBATCH -o job-%J.out

# With -s you can define how many sequence are allowed per file after splitting. Here, we create one file per sequence.
seqkit split2 Me1_v2.release.gff3.pep.fasta -s 100

Let's assume I renamed the output folder that contains lots of inputs to splited.

  1. Create a file named Snakefile and add these lines to it.
import os
# Please replace `PATH_TO_THE_FOLDER` with the proper path in the below line.
project_dir = "PATH_TO_THE_FOLDER"
PEPS, = glob_wildcards(os.path.join(project_dir, "splited/{pep}.fasta"))
# If you used a different fasta file for your local database, update the line below.
DB = os.path.join(project_dir, "Athaliana_447_Araport11.protein_primaryTranscriptOnly.fa")
######################################################################################################
######################################################################################################

rule all:
    input:
        expand(os.path.join(project_dir, "splited/{pep}.outfmt6"),pep=PEPS)

rule run_blast:
    input:
        pep = os.path.join(project_dir, "splited/{pep}.fasta"),
        database = DB
    output:
        os.path.join(project_dir, "splited/{pep}.outfmt6")
    threads: 15
    log:
        os.path.join(project_dir, "splited/logs/{pep}_blast.log")
    shell:
        """
        blastp -db {input.database} -query {input.pep} -outfmt 6 -out {output} -num_threads 15 -evalue 1e-7 -max_target_seqs 50
        """
  1. Create a file named config.yaml and add these lines. Put the file into a folder named profile
__default__:
  time: "04:00:00"
  partition: "medium"
  nodes: 1
  ntasks: 1
  ncpus: 1
  memory: "8G"
  node_properties: scratch
  job-name: "{rule}"
  output: "slurm-%j-%x.out"
  error: "slurm-%j-%x.err"
run_blast:
  time: "02:00:00"
  memory: "32G"
  ncpus: 17
  1. Create a bash file with a name you like and add these lines.
#!/bin/bash
#SBATCH -t 48:00:00
#SBATCH -c 2
#SBATCH --mem-per-cpu=3G
#SBATCH -o outfile-%J
#SBATCH -C scratch
#SBATCH --mail-type=BEGIN,END
#SBATCH [email protected] # if you like to get an email when the process starts and ends.

# you can change the parameter after -j to increase or decrease the maximum number of jobs that are active simultanously. Change the path after --profile to the path that contains the config.yaml
snakemake -j 100 -p --software-deployment-method conda --executor cluster-generic --cluster-generic-submit-cmd "sbatch" --profile /absolute/path/to/profile/
  1. Move to the output folder and concatenate all outputs into one single fasta file:
cat *.outfmt6 >> blast_results.outfmt6
  1. You can get the best blast hit using this.
awk '!x[$1]++' blast_results.outfmt6 > blast_results_best_blast_hit.outfmt6

References

  1. https://phytozome-next.jgi.doe.gov/info/Athaliana_Araport11
  2. https://mesotaenium.uni-goettingen.de/index.html

About

Here, I explain how to parallelize sequence similarity search using blast (or anything similar), if you have access to HPC or a big computer.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published