Updated : 11/04/2017
Version : 1.26
Release : https://github.com/atulkakrana/sPARTA/releases
Citation : Kakrana, A., Hammond, R., Patel, P., Nakano, M. & Meyers, B. C. sPARTA: a parallelized pipeline for integrated analysis of plant miRNA and cleaved mRNA data sets, including new miRNA target-identification software. Nucleic Acids Res. gku693– (2014). doi:10.1093/nar/gku693
Note: miRferno is merged with sPARTA. Target predictions can be performed directly through sPARTA.
small RNA-PARE Target Analyzer (sPARTA) is a tool which utilizes high-throughput sequencing to profile genome-wide cleavage products. sPARTA includes a built-in parallelized genome-wide target prediction module for plant miRNAs called `miRferno`. sPARTA as a whole utilizes multi-core servers to achieve two-dimensional parallelization in order to maintain a low memory footprint, imperative to achieve a full genome analysis. sPARTA requires bowtie2 in the PATH variable of the user account executing sPARTAbowtie2
- http://bowtie-bio.sourceforge.net/bowtie2/index.shtml
sPARTA requires the following python3
packages to perform properly:
numpy
- http://www.numpy.org/
scipy
- http://www.scipy.org/
These may easily be installed using (python3
) PIP
. Intructions to installPIP
- https://pip.pypa.io/en/stable/installing.html
1. If your PARE/Degradome libraries needs pre-processing i.e. adapter removal and trimming, use our pre-processing script from here: https://github.com/atulkakrana/preprocess.seq
2. sPARTA uses file extensions to identify file types, naming meta-data and selectively cleaning up temp files. Therefore, it is recommended to have appropriate file extensions.
For Ex. a genome/cDNA FASTA file should have .fa
extension.
Please see 'Arguments' section (below) for recommended file extensions.
3. Make sure that input fasta files do not have integers in name. For ex - test.1.fa or arabidopsis.new.2.4.fa Files with such names are deleted sometimes while cleanup operation
4. All the input files 1) miRNAs
2)FASTA
file for genome or transcripts and 3) degradome/PARE
in tag-count format should be in same directory, including sPARTA script
There are command line arguments that are to be used by sPARTA for proper execution. For the first execution, all steps must be performed, but once this has been completed, provided the miRNAs and genome are the same, the entire analysis will not need to be repeated. Examples of such cases are provided below.
-annoFile | GFF3 file for the species being analyzed corresponding to the genome assembly being used. Recommended file extension - '.gff3' or '.gff3' |
-annoType | The annotation file format. Currently GFF and GTF formats are supported. This option is used with and corresponds to the annoFile option |
-genomeFile | Genome file in FASTA format that will be used to extract features (genic or intergenic regions) using GFF3 file. Recommended file extension - '.fa' |
-featureFile | FASTA file containing sequences of interest (CDS, transcript, intergenic regions etc.) if user already has a set of sequences. This option is mutually exclusive to genome file and gff3 file. So either genomefile along with annoFile is used or feature set is supplied directly. Recommended file extension - '.fa' |
-genomeFeature | 0 if prediction is to be done in genic region. 1 if prediction is to be done in intergenic region |
-miRNAFile | FASTA format of miRNA sequences. Recommended file extension - '.fa' |
-tarPred | Mode of target prediction. H for heuristic. E for exhaustive. H is default if no mode is specified |
-tarScore | Scoring mode for target prediction. S for seedless. N for normal. S is default if no mode is specified |
-libs | List of PARE library files in tag count format. Data can be easily converted into tag count format using tally |
-minTagLen | Minimum length of PARE tag. Tags shorter than minTagLen will be discarded. 20 is default |
-maxTagLen | Maximum length of PARE tag. Tags longer than maxTagLen will be chopped to the specified length. 30 is default |
-accel | Y to use balanced multiple process scheme or else specify the number of processors to be used. Y is default |
--tag2FASTA | Convert tag count file for PARE libraries to FASTA files for mapping |
--map2DD | Map the PARE reads to feature set |
--validate | Flag to perform the validation of the potential cleave sites from miRferno |
--repeats | Flag to include PARE reads from repetitive regions |
--noiseFilter | Flag to include all PARE validations with p-value of <=.5, irrespective of the noise to signal ratio at cleave site and category of PARE read. |
Both the `GFF3` file and corresponding genome `FASTA` file can be downloaded from Phytozome [http://www.phytozome.net/]
1. Execution on new genome/entirely new dataset
This execution should be performed any time a new genome file (along with corresponding GFF
file) is being analyzed:
python3 sPARTA.py -genomeFile <genomeFile.fa> -annoType <GTF/GFF> -annoFile <annotationfile> -genomeFeature <0/1> -miRNAFile <miRNAFile.fa> -libs <Lib_A.txt Lib_B.txt> -tarPred -tarScore --tag2FASTA --map2DD --validate
or
a user provided feature set (FASTA file with genes, intergenic regions or any other sequences of interest) is being analyzed:
python3 sPARTA.py -featureFile <featureFile.fa> -genomeFeature <0/1> -miRNAFile <miRNAFile.fa> -libs <Lib_A.txt Lib_B.txt> -tarPred -tarScore --tag2FASTA --map2DD --validate
2. Execution on genome in which genome has already been processed This execution should be performed if a genome file has been processed previously but the miRNAs for which targets need to be predicted are new:
python3 sPARTA.py -genomeFeature <0/1> -miRNAFile <miRNAFile.fa> -libs <Lib_A.txt Lib_B.txt> -tarPred -tarScore --tag2FASTA --map2DD --validate
3. Execution on data in which genome and miRNA files have been previously processed This execution should be performed if targets for a genome file have already been predicted using a miRNA file, but new PARE libraries need to be used for validation of earlier predicted targets:
python3 sPARTA.py -genomeFeature <0/1> -libs <Lib_C.txt Lib_D.txt> --map2DD --validate
4. Execution of miRferno, just for target prediction This execution should be performed in case only predicted targets are required or PARE libraries are not available:
python3 sPARTA.py -genomeFile <genomeFile.fa> -annoType <GTF/GFF> -annoFile <annotationfile> -genomeFeature <0/1> -miRNAFile <miRNAFile.fa> -tarPred -tarScore
or
a user provided feature set (FASTA file with genes, intergenic regions or other sequences of interest) is being analyzed:
python3 sPARTA.py -featureFile <featureFile.fa> -genomeFeature <0/1> -miRNAFile <miRNAFile.fa> -tarPred -tarScore
1. PARE validation results for each library can be found in output
folder under its corresponding library name. The output
folder also contains a combined result file All.libs.validated.uniq.csv
from all the libraries.
Results from all libs were combined by removing redundant miRNA-target interaction with cleavage at same site.
2. Target prediction results can be found in 'predicted' folder under the name
All.targs.parsed.csv
revFernoMap.py : This script generates new file with genomic co-ordinates for predicted targets i.e. targets in All.targs.parsed.csv
file under the predicted
folder. It is neither part of sPARTA nor required for prediction and/or validation of targets. Instead, it might be useful for specific studies that need genomic co-ordinates for predicted targets.
Predicted targets could be huge in number, depending upon the size of the genome and number of sRNAs used as query, therefore the resulting file is usually big in size. This new script to reverse map predited targets, makes use of parallel processing to return results (with genomic co-ordinates) in a reasonable time. To use revFernoMap.py
script, simply copy it in the "predicted" generated by sPARTA or miRferno during target prediction step, and execute:
python3 revFernoMap.py
A successful run will create a new subfolder "revMapped" inside the "predicted" folder. This new file will have predited targets with genomic co-ordinates.
Kakrana, A., Hammond, R., Patel, P., Nakano, M. & Meyers, B. C. sPARTA: a parallelized pipeline for integrated analysis of plant miRNA and cleaved mRNA data sets, including new miRNA target-identification software. Nucleic Acids Res. gku693– (2014). doi:10.1093/nar/gku693 [Link]Arikit, S., Xia, R., Kakrana, A., Huang, K., Zhai, J., Yan, Z., … Meyers, B. C. (2014). An Atlas of Soybean Small RNAs Identifies Phased siRNAs from Hundreds of Coding Genes. The Plant Cell. http://doi.org/10.1105/tpc.114.131847 [Link]
Atul Kakrana[email protected]
Reza Hammond
[email protected]