OrganelleRef_PBA is a script to perform a de-novo PacBio assemblies of any organelle (chloroplast or mitochondrial genomes) using several programs.
The different steps are:
-
Search of the PacBio organelle reads by sequence homology search using BlasR with a related organelle genome. It is preferred to use an organelle sequence of the same genus or family, but if the organelle sequence coverage is high (>100X) it is possible to use organelle sequence references from the same taxonomic order.
-
Assemble of the PacBio reads using Sprai. Sprai is a pipeline that uses WGS-Assembler, but that compare the reads between them before perform the assembly to take the best 20X.
-
If the fraction of the Sprai assembly is below some ratio, the script will perform a rescaffolding using the whole PacBio set. Otherwise, it will skip this step.
-
Taking the longest sequence, it will check for the origin of the organelle comparing the sequence with the reference. Additionally it will check if there are an overlapping region produced by the circularity of the organelle.
-
It will check if the repeats have been missassembled, it will break the assembly in LCS, SSC and IR and it will try to put together, removing possible missassembled fragments.
- BioPerl -- (used to process sequences)
- Seqtk -- (used to change formats fastq/fasta)
- BlastN -- (used for the assembly, find origin and check circularity)
- BlasR -- (used to get the organelle related reads)
- Samtools -- (used to process BlasR output for coverage)
- Bedtools -- (used to calculate coverage for the repeat analysis)
- Sprai -- (used for de-novo assembly)
- WGS-Assembler -- (used for de-novo asembly by Sprai)
- SSPACE-Long -- (used for the rescaffolding)
Note: SSPACE-Long uses getopt that it is not present in the Perl5 corelib. To fix this problem you can install it with cpan Perl4::CoreLibs
.
Most of these programs can be installed from repositories (e.g. Blast).
To install the program
git clone https://github.com/aubombarely/Organelle_PBA.git
Once the directory is copied, you'll need to set up the environmental variables if the binaries of these programs are not in the PATH.
export BLASR_PATH=<path_to_BlasR_binary>;
export SAMTOOLS_PATH=<path_to_samtools_binaries>;
export SPRAI_PATH=<path_to_Sprai_scripts>;
export BLAST_PATH=<path_to_blast_binaries>;
export CA_PATH=<path_to_WGS-assembler_binaries>;
export SSPACELONG_PATH=<path_to_SSPACE-Long.pl_script>;
export BEDTOOLS_PATH=<path_to_bedtools_binaries>;
mkdir chloro_out
OrganelleRef_PBA -i MySpeciesPacBio.fastq -r MyReferenceCHL.fasta -o chloro_out
You can test the script with the test data. This data is a subset of the Arabidopsis thaliana PacBio data publicly available at SRA with the accession SRR1284093.
gunzip artha_pacbioSRR1284093_c025k.fastq.gz
gunzip artha_refchl01_artha.fa.gz
mkdir artha_chl
OrganelleRef_PBA -i artha_pacbioSRR1284093_c025k.fastq -r artha_refchl01_artha.fa -o artha_chl
Note: To speed up the process you can use multiple threads through different variables such as -b '-nproc=40' -s 'num_threads=40'