-
Notifications
You must be signed in to change notification settings - Fork 0
/
README
294 lines (274 loc) · 16.4 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
============ CLUSTER PROTEINS
0. Annotate Enterobacter using prokka
prokka --outdir ~/Desktop/prokka --locustag ENC --genus Enterobacter --species cloacae --kingdom Bacteria --gram neg ~/EnTrI/data/fasta-genome/chromosome/ENC.fa
Convert the gbf file to embl using seqret:
seqret
Read and write (return) sequences
Input (gapped) sequence(s): ~/Desktop/prokka/ENC_02162017.gbf
output sequence(s) [fp929040.fasta]: embl::ENC_02162017.embl
1. Convert gff + fasta files to embl using seqret tool from EMBOSS package (If the fasta and gff files contain chromosome and plasmid sequences, separate them)
seqret -sequence CS17p1.fna -feature -fformat gff -fopenfile CS17p1.gff -osformat embl -auto
2. If the CDS IDs are not standard (like in CS17p1-4 in our case):
sed -i -e 's/CS17p1|SC|contig000001://g' CS17p1.embl
3. Remove locus tag "ERS227112_02466:gene" from "Klebsiella_pneumoniae_RH201207_v0.3.embl" embl file. This gene has two locus tags: "ERS227112_02466" and "ERS227112_02466:gene", which probably is a mistake in annotation, and the ":gene" at the end of this tag causes jackhmmer name this gene "gene" which is impossible to track.
4. python correct-annotations.py and manually correct the added genes.
5. Convert embl files to protein fasta files
cd EnTrI/bin
mkdir ../data/fasta-protein
mkdir ../data/fasta-protein/chromosome
for filename in ../data/embl/chromosome/*.embl; do embl2peptides.py $filename ../data/fasta-protein/chromosome/"$(basename $filename | cut -d. -f1).fasta"; done
6. Add SL3261 to the sequences by copying SL1344 and removing aroA, ycaL and cmk genes from it (Lars' NAR paper) and renaming sequences.
7. Make a fasta file with all sequences in it.
cd EnTrI/data/fasta-protein/chromosome
cat * > seqdb.fasta
8. Run homclust to cluster homologous proteins
cd EnTrI/bin
homclust.py ../data/fasta-protein/chromosome/Citrobacter_rodentium_ICC168_FN543502_v1.fasta ../data/fasta-protein/chromosome/seqdb.fasta ../results/homclust -ns 14 -s 5
9. Convert embl files to DNA fasta files
cd EnTrI/bin
mkdir ../data/fasta-dna
mkdir ../data/fasta-dna/chromosome
for filename in ../data/embl/chromosome/*.embl; do embl2dna.py $filename ../data/fasta-dna/chromosome/"$(basename $filename | cut -d. -f1).fasta"; done
10. Add SL3261 to the sequences by copying SL1344 and removing aroA, ycaL and cmk genes from it (Lars' NAR paper) and renaming sequences.
11. Make a fasta file with all sequences in it.
cd EnTrI/data/fasta-dna/chromosome
cat * > seqdb.fasta
============ RUN PHYLOSIFT & RAXML
1. cd EnTrI/results/phylosift
2. find ../../data/fasta-genome/chromosome/ -maxdepth 1 -name "*.fa" -exec ~/program-bank/phylosift_v1.0.1/phylosift search --isolate --besthit {} \;
3. find ../../data/fasta-genome/chromosome/ -maxdepth 1 -name "*.fa" -exec ~/program-bank/phylosift_v1.0.1/phylosift align --isolate --besthit {} \;
4.a. for f in PS_temp/*/alignDir/concat.codon.updated.1.fasta; do cat "$f" >> codon_alignment.fa; done
4.b. for f in PS_temp/*/alignDir/concat.updated.1.fasta; do cat "$f" >> protein_alignment.fa; done
5. Run RaXML
a. raxmlHPC -s protein_alignment.fa -n phylosift-aa.raxmlbootstrap -m PROTGAMMALG4M -p 1234 -f a -x 1234 -# 100
b. raxmlHPC -s codon_alignment.fa -n phylosift-n.raxmlbootstrap -m GTRCAT -p 1234 -f a -x 1234 -# 100
============ RUN HIERANOID
1. Copy all the protein fasta sequences to program-bank/hieranoid/data/sequences and replace all '<>s with _prime().
2. Copy the species tree to program-bank/hieranoid/data/tree, root it, and add node names to it.
3. Edit program-bank/hieranoid/Configurations/Configuration.pm
4. cd program-bank/hieranoid/
5. perl hieranoid.pl
6. Copy n14.OGTree.txt to results/hieranoid and rename it to hieranoid-result.txt
7. python edit-hieranoid.py
8. python edit-hieranoid-with-inparalogs.py
============ FIND ESSENTIAL GENES
1. cd EnTrI/bin
2. Rscript calculate-insertion-index.R
3. Rscript calculate-insertion-index_not-trimmed.R
============ BENCHMARK DIFFERENT METHODS
1. Download the list of E.coli K-12 essential genes from ecogenes and save it as ecogene-k12.txt in results directory.
2. python find-k12-counterparts.py and put the results in results/ecogenecounterparts
3. Make monte-carlo directory in results and run Rscript montecarlo.R
4. Rscript benchmark-essentiality-calls.R
============ STUDY BIASES
1. Make logos for top 100 most frequent insertion sites:
mkdir ../results/logos
python3 make-logos-top-100.py
2. Generate a logo for the sequences resulted from the last step.
cd EnTrI/results/logos
weblogo -F pdf -A dna -f 100logos.txt -o ../../figures/100logo-prob.pdf -s large -U probability
Get nucleotide compositions from count-nucleotides.py
weblogo -F pdf -A dna -f 100logos.txt -o ../../figures/100logo-bits.pdf -s large --composition "{'A':23, 'C':27, 'G':27, 'T':23}"
3. python3 insertion-position-bias.py
4. Rscript insertion-position-bias.R
5. python3 check-biases.py
6. python3 check-biases_not-trimmed.py
7. Rscript check-biases-ii.R
8. Rscript compare-dbscan-gamma.R
9. Rscript compare-bias-predictions.R
============ EVOLUTION OF ESSENTIALITY
1. python3 define-core-accessory-hieranoid.py
python3 define-core-accessory-hieranoid-fasta.py (edit to change thresholds to 80,90, and 100)
python3 define-core-accessory-hieranoid-fasta-2.py --> used for generating table for important genes and testing wether they are core essential or not.
2. Make phylogenetic trees for the subsets of species (in code) and place them in bin/speciestrees, and bin/speciestrees-no-k12
# 3. python3 define-core-accessory-hieranoid-fdollop.py
# 4. python3 define-core-accessory-hieranoid-ancestralii.py
5. python3 define-core-accessory-hieranoid-fitch.py
cd EnTrI/results/define-core-accessory-hieranoid-fitch-cores/
mkdir alignments
for filename in clust*; do mafft --text --quiet $filename > alignments/$filename.msa; done
mkdir profiles
cd alignments
for filename in *.msa; do hmmbuild -o /dev/null ../profiles/$filename $filename; done
cd ../profiles/
cat * > corefam.hmm
hmmpress corefam.hmm
Go to homology folder on cluster
qsub embl2peptide.sh
qsub run_hmmsearch.sh
qsub run_con-ess.sh
scp fas31@abacus:/data/people/fas31/homology/homologs.count /home/fatemeh/EnTrI/results/homologs.count
6. Rscript upsetr-hieranoid.R
7. Annotate the phylogenetic tree
8. Correct tree-data.R and run it
9. Generate a table for core essential and ancestrally essential genes
python3 table-core-genes.py
============ STUDY CLUSTERS
1. Plot cluster size distribution (This is not a general script and can only be used for this case)
cd EnTrI/bin
Rscript orfan-single-multiple.R
2. Run merge-clust-plot to add insertion indices to genes in clusters
cd EnTrI/bin
merge-clust-plot.py ../results/homclust/EFam-clusters/ ../results/biases/dbscan/ ../results/ecogene-k12.txt ../results/merge-clust-plot
3. Plot the insertion index distribution in different clusters (This is not a general script and can only be used for this case)
cd EnTrI/bin
Rscript clust2plot.R
4. Count multicopy genes that are essential when single copy:
Rscript multicopy_ess-when-sing-copy.R
============ ENRICHMENT
1. Word enrichment in beneficial losses
cd EnTrI/bin
python study-beneficial-losses.py
Rscript study-beneficial-losses.R
# 2. Word enrichment in low GC genes:
cd EnTrI/bin
python study-low-gc.py
Rscript study-low-gc.R
# 3. Codon enrichment in 5' end of essential genes:
cd EnTrI/bin
python essential5prime.py
Rscript essential5prime.R
4. KEGG analysis
cd EnTrI/bin
Rscript buildpathwayDB.R -o KEGG_organism_code -d path_to_save_database
Set pathanalysis.R parameters: essentiality class, pdf file name, plot main.
Rscript pathanalysis.R
Rscript pathanalysis-coregenes.R -> change between core genes and ancestral genes.
# 5. Difference between ancestralii, fdollop, and intersection
cd EnTrI/bin
python3 core-definition-method-difference.py
Rscript core-definition-method-difference.R
============ STUDY INTERESTING GENES
1. python interesting_genes.py
2. python mark-duplications.py run for both universally-conserved_always-essential.tsv and sometimes-essential.tsv
3. Rscript interestinggenes_heatmap.R
# 4. change the first bit of interestinggenes_heatmap (up to write.table) to add pathways to genes.
# ============ ARE CONSERVED GENES ESSENTIAL?
# 1. Download bacterial and archaeal core genes from:
# https://figshare.com/articles/Systematically_identify_phylogenetic_markers_at_different_taxonomic_levels_for_bacteria_and_archaea/722713
# paper:
# http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0077033
# 2. cd results/core-bacteria_and_archaea-genes
# for filename in *.hmm; do hmmpress $filename; done
# 3. for filename in *.hmm; do hmmscan --domtblout $filename.domtblout $filename ~/EnTrI/data/fasta-protein/chromosome/U00096.fasta; done
# 4. make a file containing all genes.
============ Ideal insertion density?
Run density-SL1344.R
============ ARE CONSERVED GENES MORE LIKELY TO BE ESSENTIAL?
1. Make HMM profiles for core genes and core essential genes
cd EnTrI/results/define-core-accessory-hieranoid-fitch-cores
mkdir alignments && mkdir profiles
for filename in clust*; do mafft --text --quiet $filename > alignments/$filename.msa; done
cd alignments
for filename in *.msa; do hmmbuild -o /dev/null ../profiles/$filename $filename; done
cd ../profiles
cat * > corefam.hmm
hmmpress corefam.hmm
cd ..
mkdir hhmake
for filename in alignments/*; do hhmake -i $filename -a hhmake/hhmake.out -M 50; done
Do the same for EnTrI/results/define-core-accessory-hieranoid-fitch-core-essentials
2. Write cluster name, gene name, and pathways in the same file for ancestrally essential genes.
python which-genes.py
3. Follow what we did on cluster: /data/people/fas31/homology/readme.txt
4.
mkdir EnTrI/results/walking-hypergeometric-test
Make essentiality.txt
cd EnTrI/bin
Rscript gsea.R -d essentiality.txt -l conservation.out -o ~/EnTrI/results/walking-hypergeometric-test
5. Do the same for endosymbionts and DEG.
============ COMPARISON OF ANCESTRALLY ESSENTIAL GENES AND ENDOSYMBIONTS
1. Download Buchnera, Sodalis, Wigglesworthia, Baumannia Cicadellinicola, Blochmania fasta and embl files.
2. for filename in ~/EnTrI/results/endosymbionts/embl/*.txt; do embl2peptides.py $filename ~/EnTrI/results/endosymbionts/fasta-protein/"$(basename $filename | cut -d. -f1).fasta"; done
3. find /home/fatemeh/EnTrI/results/endosymbionts/fasta-genome/ -maxdepth 1 -name "*.fasta" -exec ~/program-bank/phylosift_v1.0.1/phylosift search --isolate --besthit {} \;
4. find /home/fatemeh/EnTrI/results/endosymbionts/fasta-genome/ -maxdepth 1 -name "*.fasta" -exec ~/program-bank/phylosift_v1.0.1/phylosift align --isolate --besthit {} \;
5. for f in PS_temp/*/alignDir/concat.updated.1.fasta; do cat "$f" >> protein_alignment.fa; done
6. raxmlHPC -s protein_alignment.fa -n phylosift-aa.raxmlbootstrap -m PROTGAMMALG4M -p 1234 -f a -x 1234 -# 100
7. Get essential genes from DEG and divide the file into different species and then run Hieranoid after rooting and editing the species tree. Follow the steps for running hieranoid and editing output from above.
8. python3 define-core-accessory-hieranoid-endosymbionts.py
9.
cd EnTrI/results/endosymbionts/define-core-accessory-hieranoid/core-essential-genomes/
mkdir alignments && mkdir profiles
for filename in clust*; do mafft --text --quiet $filename > alignments/$filename.msa; done
cd alignments
for filename in *.msa; do hmmbuild -o /dev/null ../profiles/$filename $filename; done
cd ../profiles
cat * > corefam.hmm
cd ..
mkdir hhmake
for filename in alignments/*; do hhmake -i $filename -o hhmake/"endo-$(basename $filename | cut -d. -f1).hmm" -M 50; done
============ COMPARISON OF ANCESTRALLY ESSENTIAL GENES AND DEG
1. Download DEG embl files and genomes (-enterobacteriaceae) and concat chromosomes.
2. find /home/fatemeh/EnTrI/results/deg/fasta-genome/ -maxdepth 1 -name "*.fasta" -exec ~/program-bank/phylosift_v1.0.1/phylosift search --isolate --besthit {} \;
3. find /home/fatemeh/EnTrI/results/deg/fasta-genome/ -maxdepth 1 -name "*.fasta" -exec ~/program-bank/phylosift_v1.0.1/phylosift align --isolate --besthit {} \;
4. for f in PS_temp/*/alignDir/concat.updated.1.fasta; do cat "$f" >> protein_alignment.fa; done
5. raxmlHPC -s protein_alignment.fa -n phylosift-aa.raxmlbootstrap -m PROTGAMMALG4M -p 1234 -f a -x 1234 -# 100
6. Get essential genes from DEG and divide the file into different species and then run Hieranoid after rooting (for rooting separate gram negatives from gram positives in DEG) and editing the species tree. Follow the steps for running hieranoid and editing output from above.
7. python define-core-accessory-hieranoid-fitch-deg.py
python3 define-core-accessory-hieranoid-deg.py
8.
cd EnTrI/results/deg/define-core-accessory-hieranoid/core-essential-genomes/
mkdir alignments && mkdir profiles
for filename in clust*; do mafft --text --quiet $filename > alignments/$filename.msa; done
cd alignments
for filename in *.msa; do hmmbuild -o /dev/null ../profiles/$filename $filename; done
cd ../profiles
cat * > corefam.hmm
cd ..
mkdir hhmake
for filename in alignments/*; do hhmake -i $filename -o hhmake/"deg-$(basename $filename | cut -d. -f1).hmm" -M 50; done
9. python venn-entero-deg-endo.py 1-3
10. python pathways-for-clusts.py
11. python venn-entero-deg-endo-4.py
12. python entero-deg-endo-heatmap-1.py
13. annotate the output with eggnog mapper
14. python entero-deg-endo-heatmap-2.py
15. Rscript entero-deg-endo-heatmap-3.R (use partition-seqdb.py in results/deg/fasta-proteins beforehand)
16. python entero-deg-endo-heatmap-4.py , 5, 6
17. Manually correct NAs from these files:
heatmap.tsv, U00096.fasta.mapper.annotations, final-clusters-plus-single-groups.fasta.emapper.annotations, final-clusters-plus-single-groups, and /home/fatemeh/EnTrI/results/define-core-accessory-hieranoid/all/core-essential-genomes
============ COMPARISON OF ACCESSORY ESSENTIAL GENES
accessory-essential-enterobacteriaceae_1..5
============ WHY DO WE HAVE INSERTIONS IN 5' END?
1. python 5prime-start-codon.py
============ COMPARE CORE ESSENTIAL AND ANCESTRALLY ESSENTIAL
1. ancestral-not-core.py
2. ancestral-not-core.R
============ RUN DCGO
1. hmmscan --cut_ga --domtblout ~/EnTrI/results/dcgo/seqdb_Pfam-A.domtblout -o /dev/null ~/data/pfam/Pfam-A.hmm ~/EnTrI/data/fasta-protein/chromosome/seqdb.fasta
2. dcgo.py and dcgo.R
============ ARE BENEFICIAL LOSSES COPIED IN PLASMIDS?
1. for filename in ../data/embl/new-plasmid/*.embl; do embl2peptides.py $filename ../data/fasta-plasmid/"$(basename $filename | cut -d. -f1).fasta"; done
2. change locus tags so that they are similar to chromosomal genomes, concat genes of plasmids for the same chromosome, and rename files.
3. cat ~/EnTrI/results/homclust/EFam-HMMs*.hmm > ~/EnTrI/results/beneficialloss-plasmid/all.hmm
4. hmmpress all.hmm
5. hmmscan -o /dev/null --domtblout ~/EnTrI/results/beneficialloss-plasmid/hmmscan.domtblout -E 1e-10 ~/EnTrI/results/beneficialloss-plasmid/all.hmm ~/EnTrI/data/fasta-plasmid/seqdb.fasta
6. python beneficialloss-plasmid.py
============ Generate giant table
#1. python giant-tab-1-del.py
#2.
cd results/giant-tab/clusters/
mkdir alignments && mkdir profiles
for filename in clust*; do mafft --text --quiet $filename > alignments/$filename.msa; done
cd alignments
for filename in *.msa; do hmmbuild -o /dev/null ../profiles/$filename $filename; done
cd ../profiles
cat * > corefam.hmm
cd ..
mkdir hhmake
for filename in alignments/*; do hhmake -i $filename -o hhmake/"entcore-$(basename $filename | cut -d. -f1).hmm" -M 50; done
#1. python giant-tab-1.py
1. cd EnTrI/results/all-hieranoid/
find fasta-genome/ -maxdepth 1 -name "*.fa*" -exec ~/program-bank/phylosift_v1.0.1/phylosift search --isolate --besthit {} \;
find fasta-genome/ -maxdepth 1 -name "*.fa*" -exec ~/program-bank/phylosift_v1.0.1/phylosift align --isolate --besthit {} \;
for f in PS_temp/*/alignDir/concat.updated.1.fasta; do cat "$f" >> protein_alignment.fa; done
raxmlHPC -s protein_alignment.fa -n phylosift-aa.raxmlbootstrap -m PROTGAMMALG4M -p 1234 -f a -x 1234 -# 100
2. Copy all the protein fasta sequences to program-bank/hieranoid/data/sequences and replace all '<>s with _prime().
3. Copy the species tree to program-bank/hieranoid/data/tree, root it, and add node names to it.
4. Edit program-bank/hieranoid/Configurations/Configuration.pm
5. cd program-bank/hieranoid/
6. perl hieranoid.pl
7. for all protein sequences run cat * > seqdb.fasta to have all sequences together
8. edit the addresses and run python3 edit-hieranoid.py
9. run annotate-clust.py and run eggnog-mapper on the output (cluster-representatives.txt) to find gene names for clusters.
10. run giant-tab-1.py, giant-tab-2.py and giant-tab-3.R