-
Notifications
You must be signed in to change notification settings - Fork 7
/
README.txt
149 lines (109 loc) · 7.96 KB
/
README.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
sPARTA: small RNA-PARE Target Analyzer Version
Updated: version-1.20 09/30/2016
======== Description ========
small RNA-PARE Target Analyzer (sPARTA) is a tool which utilizes
high-throughput sequencing to profile genome-wide cleavage products.
sPARTA begins with a built-in parallelized target prediction module for plant
miRNAs called 'miRferno'. sPARTA as a whole utilizes multi-core servers to
achieve two-dimensional parallelization in order to maintain a low memory
footprint, imperative to achieve a full genome analysis.
======== Dependencies ========
sPARTA requires bowtie2 in the PATH variable of the user account executing sPARTA
bowtie2 may be downloaded here http://bowtie-bio.sourceforge.net/bowtie2/index.shtml
sPARTA requires the following python3 functions to perform properly:
numpy - http://www.numpy.org/
scipy - http://www.scipy.org/
These may easily be installed using (Python) PIP. Intructions to install PIP - https://pip.pypa.io/en/stable/installing.html
========= Note ===========
1.
sPARTA uses file extensions to identify file types, naming meta-data and selectively
cleaning up temp files. Therefore, it is recommended to have appropriate file extensions.
For Ex. a genome/cDNA FASTA file should have '.fa' extension.
Please see 'Arguments' section (below) for recommended file extensions.
2.
Make sure that input fasta files do not have integers in name. For ex - test.1.fa or arabidopsis.new.2.4.fa
Files with such names are deleted sometimes while cleanup operation
3.
All the input files 1) miRNAs 2) FASTA file for genome or transcripts and 3) degradome/PARE in tag-count format should be in same directory,including sPARTA script
======== Execution ========
There are command line arguments that are to be used by sPARTA for proper
execution. For the first execution, all steps must be performed, but
once this has been completed, provided the miRNAs and genome are the same,
the entire analysis will not need to be repeated. Examples of such cases
may be seen below.
======= Arguments ========
annoFile GFF3 or GTF file for the species being analyzed corresponding
... to the genome assembly being used. Recommended file
... extension - '.gff3' or '.gtf'
annoType The annotation file format. Currently GFF and GTF is
... supported. This option is used with and corresponds to
... the annoFile option
genomeFile Genome file in FASTA format that will be used to extract
... features (genic or intergenic regions) using GFF3 file.
Recommended file extension - '.fa'
featureFile FASTA file containing sequences of interest (CDS, transcript,
... intergenic regions etc.) if user already has a set of
... sequences. This option is mutually exclusive to genome file and
... gff file. So either genomefile along with annoFile is used or
... feature set is supplied directly. Recommended file extension - '.fa'
genomeFeature 0 if prediction is to be done in genic region. 1 if prediction
... is to be done in intergenic region
miRNAFile FASTA format of miRNA sequences. Recommended file extension - '.fa'
tarPred Mode of target prediction. H for heuristic. E for exhaustive.
... H is default if no mode is specified
tarScore Scoring mode for target prediction. S for seedless. N for
... normal. S is default if no mode is specified
libs List of PARE library files in tag count format. Data can be
... easily converted into tag count format using *********
tagLen Minimum length of PARE tag, tags longer than tagLen will be
... chopped to the specified length. 20 is default
--tag2FASTA Convert tag count file for PARE libraries to FASTA files for
... mapping
--map2DD Map the PARE reads to feature set
--validate Flag to perform the validation of the potential cleave sites
... from miRferno
--repeats Flag to include PARE reads from repetitive regions
--noiseFilter Flag to include all PARE validations with p-value of <=.5,
... irrespective of the noise to signal ratio at cleave site and
... category of PARE read.
accel Y to use balanced multiple process scheme or else specify the
... number of processors to be used. Y is default
======== Genome and Annotation Data ========
Both the GFF3 file and corresponding genome FASTA file can be downloaded from
Phytozome [http://www.phytozome.net/]
==============Examples ==================
1. Execution on new genome/entirely new dataset
This execution should be performed any time a new genome file (along with corresponding GFF3 or GTF file) is being analyzed:
python3 sPARTA.py -genomeFile <genomeFile.fa> -annoType <GTF/GFF> -annoFile <annotationfile> -genomeFeature <0/1> -miRNAFile <miRNAFile.fa> -libs <Lib_A.txt Lib_B.txt> -tarPred -tarScore --tag2FASTA --map2DD --validate
OR
a user provided feature set (FASTA file with sequences of interest) is being analyzed:
python3 sPARTA.py -featureFile <featureFile.fa> -genomeFeature <0/1> -miRNAFile <miRNAFile.fa> -libs <Lib_A.txt Lib_B.txt> -tarPred -tarScore --tag2FASTA --map2DD --validate
2. Execution on genome in which genome has already been processed
This execution should be performed if a genome file has been processed previously but the miRNAs for which targets need to be predicted are new:
python3 sPARTA.py -genomeFeature <0/1> -miRNAFile <miRNAFile.fa> -libs <Lib_A.txt Lib_B.txt> -tarPred -tarScore --tag2FASTA --map2DD --validate
3. Execution on data in which genome and miRNA files have been previously processed
This execution should be performed if targets for a genome file have already been predicted using a miRNA file, but new PARE libraries need to be used for validation of earlier predicted targets:
python3 sPARTA.py -genomeFeature <0/1> -libs <Lib_C.txt Lib_D.txt> --map2DD --validate
4. Execution of 'miRferno', just for target prediction
This execution should be performed in case only predicted targets are required or PARE libraries are not available:
python3 sPARTA.py -genomeFile <genomeFile.fa> -annoType <GTF/GFF> -annoFile <annotationfile> -genomeFeature <0/1> -miRNAFile <miRNAFile.fa> -tarPred -tarScore
OR
a user provided feature set (FASTA file with sequences of interest) is being analyzed:
python3 sPARTA.py -featureFile <featureFile.fa> -genomeFeature <0/1> -miRNAFile <miRNAFile.fa> -tarPred -tarScore
======== Output ==========
1. PARE validation results for each library can be found in 'output' folder
under its corresponding library name. The 'output' folder also contains a combined result file (AllLibValidatedUniq.csv) from all the libraries.
Results from all libs were combined by removing redundant miRNA-target interaction with cleavage at same site.
2. Target prediction results can be found in 'predicted' folder under the name
'All.targs.parsed.csv'
===== Other scripts ======
revFernoMap.py : This script generates new file with genomic co-ordinates for predicted targets i.e. targets in 'All.targs.parsed.csv' file under the 'predicted' folder. It is neither part of sPARTA nor required for prediction and/or validation of targets. Instead, it might be useful for specific studies that need genomic co-ordinates for predicted targets.
Predicted targets could be huge in number, depending upon the size of the genome and number of sRNAs used as query, therefore the resulting file is usually big in size. This new script to reverse map predited targets, makes use of parallel processing to return results (with genomic co-ordinates) in a reasonable time. To use "revFernoMap.py" script, simply copy it in the "predicted" generated by sPARTA or miRferno during target prediction step, and execute:
python3 revFernoMap.py
A successful run will create a new subfolder "revMapped" inside the "predicted" folder. This new file will have predited targets with genomic co-ordinates.
======= Contact ===========
Atul Kakrana
Reza Hammond
===== END of README =======