-
Notifications
You must be signed in to change notification settings - Fork 0
Gardner-BinfLab/EnTrI
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
============ CLUSTER PROTEINS 0. Annotate Enterobacter using prokka prokka --outdir ~/Desktop/prokka --locustag ENC --genus Enterobacter --species cloacae --kingdom Bacteria --gram neg ~/EnTrI/data/fasta-genome/chromosome/ENC.fa Convert the gbf file to embl using seqret: seqret Read and write (return) sequences Input (gapped) sequence(s): ~/Desktop/prokka/ENC_02162017.gbf output sequence(s) [fp929040.fasta]: embl::ENC_02162017.embl 1. Convert gff + fasta files to embl using seqret tool from EMBOSS package (If the fasta and gff files contain chromosome and plasmid sequences, separate them) seqret -sequence CS17p1.fna -feature -fformat gff -fopenfile CS17p1.gff -osformat embl -auto 2. If the CDS IDs are not standard (like in CS17p1-4 in our case): sed -i -e 's/CS17p1|SC|contig000001://g' CS17p1.embl 3. Remove locus tag "ERS227112_02466:gene" from "Klebsiella_pneumoniae_RH201207_v0.3.embl" embl file. This gene has two locus tags: "ERS227112_02466" and "ERS227112_02466:gene", which probably is a mistake in annotation, and the ":gene" at the end of this tag causes jackhmmer name this gene "gene" which is impossible to track. 4. python correct-annotations.py and manually correct the added genes. 5. Convert embl files to protein fasta files cd EnTrI/bin mkdir ../data/fasta-protein mkdir ../data/fasta-protein/chromosome for filename in ../data/embl/chromosome/*.embl; do embl2peptides.py $filename ../data/fasta-protein/chromosome/"$(basename $filename | cut -d. -f1).fasta"; done 6. Add SL3261 to the sequences by copying SL1344 and removing aroA, ycaL and cmk genes from it (Lars' NAR paper) and renaming sequences. 7. Make a fasta file with all sequences in it. cd EnTrI/data/fasta-protein/chromosome cat * > seqdb.fasta 8. Run homclust to cluster homologous proteins cd EnTrI/bin homclust.py ../data/fasta-protein/chromosome/Citrobacter_rodentium_ICC168_FN543502_v1.fasta ../data/fasta-protein/chromosome/seqdb.fasta ../results/homclust -ns 14 -s 5 9. Convert embl files to DNA fasta files cd EnTrI/bin mkdir ../data/fasta-dna mkdir ../data/fasta-dna/chromosome for filename in ../data/embl/chromosome/*.embl; do embl2dna.py $filename ../data/fasta-dna/chromosome/"$(basename $filename | cut -d. -f1).fasta"; done 10. Add SL3261 to the sequences by copying SL1344 and removing aroA, ycaL and cmk genes from it (Lars' NAR paper) and renaming sequences. 11. Make a fasta file with all sequences in it. cd EnTrI/data/fasta-dna/chromosome cat * > seqdb.fasta ============ RUN PHYLOSIFT & RAXML 1. cd EnTrI/results/phylosift 2. find ../../data/fasta-genome/chromosome/ -maxdepth 1 -name "*.fa" -exec ~/program-bank/phylosift_v1.0.1/phylosift search --isolate --besthit {} \; 3. find ../../data/fasta-genome/chromosome/ -maxdepth 1 -name "*.fa" -exec ~/program-bank/phylosift_v1.0.1/phylosift align --isolate --besthit {} \; 4.a. for f in PS_temp/*/alignDir/concat.codon.updated.1.fasta; do cat "$f" >> codon_alignment.fa; done 4.b. for f in PS_temp/*/alignDir/concat.updated.1.fasta; do cat "$f" >> protein_alignment.fa; done 5. Run RaXML a. raxmlHPC -s protein_alignment.fa -n phylosift-aa.raxmlbootstrap -m PROTGAMMALG4M -p 1234 -f a -x 1234 -# 100 b. raxmlHPC -s codon_alignment.fa -n phylosift-n.raxmlbootstrap -m GTRCAT -p 1234 -f a -x 1234 -# 100 ============ RUN HIERANOID 1. Copy all the protein fasta sequences to program-bank/hieranoid/data/sequences and replace all '<>s with _prime(). 2. Copy the species tree to program-bank/hieranoid/data/tree, root it, and add node names to it. 3. Edit program-bank/hieranoid/Configurations/Configuration.pm 4. cd program-bank/hieranoid/ 5. perl hieranoid.pl 6. Copy n14.OGTree.txt to results/hieranoid and rename it to hieranoid-result.txt 7. python edit-hieranoid.py 8. python edit-hieranoid-with-inparalogs.py ============ FIND ESSENTIAL GENES 1. cd EnTrI/bin 2. Rscript calculate-insertion-index.R 3. Rscript calculate-insertion-index_not-trimmed.R ============ BENCHMARK DIFFERENT METHODS 1. Download the list of E.coli K-12 essential genes from ecogenes and save it as ecogene-k12.txt in results directory. 2. python find-k12-counterparts.py and put the results in results/ecogenecounterparts 3. Make monte-carlo directory in results and run Rscript montecarlo.R 4. Rscript benchmark-essentiality-calls.R ============ STUDY BIASES 1. Make logos for top 100 most frequent insertion sites: mkdir ../results/logos python3 make-logos-top-100.py 2. Generate a logo for the sequences resulted from the last step. cd EnTrI/results/logos weblogo -F pdf -A dna -f 100logos.txt -o ../../figures/100logo-prob.pdf -s large -U probability Get nucleotide compositions from count-nucleotides.py weblogo -F pdf -A dna -f 100logos.txt -o ../../figures/100logo-bits.pdf -s large --composition "{'A':23, 'C':27, 'G':27, 'T':23}" 3. python3 insertion-position-bias.py 4. Rscript insertion-position-bias.R 5. python3 check-biases.py 6. python3 check-biases_not-trimmed.py 7. Rscript check-biases-ii.R 8. Rscript compare-dbscan-gamma.R 9. Rscript compare-bias-predictions.R ============ EVOLUTION OF ESSENTIALITY 1. python3 define-core-accessory-hieranoid.py python3 define-core-accessory-hieranoid-fasta.py (edit to change thresholds to 80,90, and 100) python3 define-core-accessory-hieranoid-fasta-2.py --> used for generating table for important genes and testing wether they are core essential or not. 2. Make phylogenetic trees for the subsets of species (in code) and place them in bin/speciestrees, and bin/speciestrees-no-k12 # 3. python3 define-core-accessory-hieranoid-fdollop.py # 4. python3 define-core-accessory-hieranoid-ancestralii.py 5. python3 define-core-accessory-hieranoid-fitch.py cd EnTrI/results/define-core-accessory-hieranoid-fitch-cores/ mkdir alignments for filename in clust*; do mafft --text --quiet $filename > alignments/$filename.msa; done mkdir profiles cd alignments for filename in *.msa; do hmmbuild -o /dev/null ../profiles/$filename $filename; done cd ../profiles/ cat * > corefam.hmm hmmpress corefam.hmm Go to homology folder on cluster qsub embl2peptide.sh qsub run_hmmsearch.sh qsub run_con-ess.sh scp fas31@abacus:/data/people/fas31/homology/homologs.count /home/fatemeh/EnTrI/results/homologs.count 6. Rscript upsetr-hieranoid.R 7. Annotate the phylogenetic tree 8. Correct tree-data.R and run it 9. Generate a table for core essential and ancestrally essential genes python3 table-core-genes.py ============ STUDY CLUSTERS 1. Plot cluster size distribution (This is not a general script and can only be used for this case) cd EnTrI/bin Rscript orfan-single-multiple.R 2. Run merge-clust-plot to add insertion indices to genes in clusters cd EnTrI/bin merge-clust-plot.py ../results/homclust/EFam-clusters/ ../results/biases/dbscan/ ../results/ecogene-k12.txt ../results/merge-clust-plot 3. Plot the insertion index distribution in different clusters (This is not a general script and can only be used for this case) cd EnTrI/bin Rscript clust2plot.R 4. Count multicopy genes that are essential when single copy: Rscript multicopy_ess-when-sing-copy.R ============ ENRICHMENT 1. Word enrichment in beneficial losses cd EnTrI/bin python study-beneficial-losses.py Rscript study-beneficial-losses.R # 2. Word enrichment in low GC genes: cd EnTrI/bin python study-low-gc.py Rscript study-low-gc.R # 3. Codon enrichment in 5' end of essential genes: cd EnTrI/bin python essential5prime.py Rscript essential5prime.R 4. KEGG analysis cd EnTrI/bin Rscript buildpathwayDB.R -o KEGG_organism_code -d path_to_save_database Set pathanalysis.R parameters: essentiality class, pdf file name, plot main. Rscript pathanalysis.R Rscript pathanalysis-coregenes.R -> change between core genes and ancestral genes. # 5. Difference between ancestralii, fdollop, and intersection cd EnTrI/bin python3 core-definition-method-difference.py Rscript core-definition-method-difference.R ============ STUDY INTERESTING GENES 1. python interesting_genes.py 2. python mark-duplications.py run for both universally-conserved_always-essential.tsv and sometimes-essential.tsv 3. Rscript interestinggenes_heatmap.R # 4. change the first bit of interestinggenes_heatmap (up to write.table) to add pathways to genes. # ============ ARE CONSERVED GENES ESSENTIAL? # 1. Download bacterial and archaeal core genes from: # https://figshare.com/articles/Systematically_identify_phylogenetic_markers_at_different_taxonomic_levels_for_bacteria_and_archaea/722713 # paper: # http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0077033 # 2. cd results/core-bacteria_and_archaea-genes # for filename in *.hmm; do hmmpress $filename; done # 3. for filename in *.hmm; do hmmscan --domtblout $filename.domtblout $filename ~/EnTrI/data/fasta-protein/chromosome/U00096.fasta; done # 4. make a file containing all genes. ============ Ideal insertion density? Run density-SL1344.R ============ ARE CONSERVED GENES MORE LIKELY TO BE ESSENTIAL? 1. Make HMM profiles for core genes and core essential genes cd EnTrI/results/define-core-accessory-hieranoid-fitch-cores mkdir alignments && mkdir profiles for filename in clust*; do mafft --text --quiet $filename > alignments/$filename.msa; done cd alignments for filename in *.msa; do hmmbuild -o /dev/null ../profiles/$filename $filename; done cd ../profiles cat * > corefam.hmm hmmpress corefam.hmm cd .. mkdir hhmake for filename in alignments/*; do hhmake -i $filename -a hhmake/hhmake.out -M 50; done Do the same for EnTrI/results/define-core-accessory-hieranoid-fitch-core-essentials 2. Write cluster name, gene name, and pathways in the same file for ancestrally essential genes. python which-genes.py 3. Follow what we did on cluster: /data/people/fas31/homology/readme.txt 4. mkdir EnTrI/results/walking-hypergeometric-test Make essentiality.txt cd EnTrI/bin Rscript gsea.R -d essentiality.txt -l conservation.out -o ~/EnTrI/results/walking-hypergeometric-test 5. Do the same for endosymbionts and DEG. ============ COMPARISON OF ANCESTRALLY ESSENTIAL GENES AND ENDOSYMBIONTS 1. Download Buchnera, Sodalis, Wigglesworthia, Baumannia Cicadellinicola, Blochmania fasta and embl files. 2. for filename in ~/EnTrI/results/endosymbionts/embl/*.txt; do embl2peptides.py $filename ~/EnTrI/results/endosymbionts/fasta-protein/"$(basename $filename | cut -d. -f1).fasta"; done 3. find /home/fatemeh/EnTrI/results/endosymbionts/fasta-genome/ -maxdepth 1 -name "*.fasta" -exec ~/program-bank/phylosift_v1.0.1/phylosift search --isolate --besthit {} \; 4. find /home/fatemeh/EnTrI/results/endosymbionts/fasta-genome/ -maxdepth 1 -name "*.fasta" -exec ~/program-bank/phylosift_v1.0.1/phylosift align --isolate --besthit {} \; 5. for f in PS_temp/*/alignDir/concat.updated.1.fasta; do cat "$f" >> protein_alignment.fa; done 6. raxmlHPC -s protein_alignment.fa -n phylosift-aa.raxmlbootstrap -m PROTGAMMALG4M -p 1234 -f a -x 1234 -# 100 7. Get essential genes from DEG and divide the file into different species and then run Hieranoid after rooting and editing the species tree. Follow the steps for running hieranoid and editing output from above. 8. python3 define-core-accessory-hieranoid-endosymbionts.py 9. cd EnTrI/results/endosymbionts/define-core-accessory-hieranoid/core-essential-genomes/ mkdir alignments && mkdir profiles for filename in clust*; do mafft --text --quiet $filename > alignments/$filename.msa; done cd alignments for filename in *.msa; do hmmbuild -o /dev/null ../profiles/$filename $filename; done cd ../profiles cat * > corefam.hmm cd .. mkdir hhmake for filename in alignments/*; do hhmake -i $filename -o hhmake/"endo-$(basename $filename | cut -d. -f1).hmm" -M 50; done ============ COMPARISON OF ANCESTRALLY ESSENTIAL GENES AND DEG 1. Download DEG embl files and genomes (-enterobacteriaceae) and concat chromosomes. 2. find /home/fatemeh/EnTrI/results/deg/fasta-genome/ -maxdepth 1 -name "*.fasta" -exec ~/program-bank/phylosift_v1.0.1/phylosift search --isolate --besthit {} \; 3. find /home/fatemeh/EnTrI/results/deg/fasta-genome/ -maxdepth 1 -name "*.fasta" -exec ~/program-bank/phylosift_v1.0.1/phylosift align --isolate --besthit {} \; 4. for f in PS_temp/*/alignDir/concat.updated.1.fasta; do cat "$f" >> protein_alignment.fa; done 5. raxmlHPC -s protein_alignment.fa -n phylosift-aa.raxmlbootstrap -m PROTGAMMALG4M -p 1234 -f a -x 1234 -# 100 6. Get essential genes from DEG and divide the file into different species and then run Hieranoid after rooting (for rooting separate gram negatives from gram positives in DEG) and editing the species tree. Follow the steps for running hieranoid and editing output from above. 7. python define-core-accessory-hieranoid-fitch-deg.py python3 define-core-accessory-hieranoid-deg.py 8. cd EnTrI/results/deg/define-core-accessory-hieranoid/core-essential-genomes/ mkdir alignments && mkdir profiles for filename in clust*; do mafft --text --quiet $filename > alignments/$filename.msa; done cd alignments for filename in *.msa; do hmmbuild -o /dev/null ../profiles/$filename $filename; done cd ../profiles cat * > corefam.hmm cd .. mkdir hhmake for filename in alignments/*; do hhmake -i $filename -o hhmake/"deg-$(basename $filename | cut -d. -f1).hmm" -M 50; done 9. python venn-entero-deg-endo.py 1-3 10. python pathways-for-clusts.py 11. python venn-entero-deg-endo-4.py 12. python entero-deg-endo-heatmap-1.py 13. annotate the output with eggnog mapper 14. python entero-deg-endo-heatmap-2.py 15. Rscript entero-deg-endo-heatmap-3.R (use partition-seqdb.py in results/deg/fasta-proteins beforehand) 16. python entero-deg-endo-heatmap-4.py , 5, 6 17. Manually correct NAs from these files: heatmap.tsv, U00096.fasta.mapper.annotations, final-clusters-plus-single-groups.fasta.emapper.annotations, final-clusters-plus-single-groups, and /home/fatemeh/EnTrI/results/define-core-accessory-hieranoid/all/core-essential-genomes ============ COMPARISON OF ACCESSORY ESSENTIAL GENES accessory-essential-enterobacteriaceae_1..5 ============ WHY DO WE HAVE INSERTIONS IN 5' END? 1. python 5prime-start-codon.py ============ COMPARE CORE ESSENTIAL AND ANCESTRALLY ESSENTIAL 1. ancestral-not-core.py 2. ancestral-not-core.R ============ RUN DCGO 1. hmmscan --cut_ga --domtblout ~/EnTrI/results/dcgo/seqdb_Pfam-A.domtblout -o /dev/null ~/data/pfam/Pfam-A.hmm ~/EnTrI/data/fasta-protein/chromosome/seqdb.fasta 2. dcgo.py and dcgo.R ============ ARE BENEFICIAL LOSSES COPIED IN PLASMIDS? 1. for filename in ../data/embl/new-plasmid/*.embl; do embl2peptides.py $filename ../data/fasta-plasmid/"$(basename $filename | cut -d. -f1).fasta"; done 2. change locus tags so that they are similar to chromosomal genomes, concat genes of plasmids for the same chromosome, and rename files. 3. cat ~/EnTrI/results/homclust/EFam-HMMs*.hmm > ~/EnTrI/results/beneficialloss-plasmid/all.hmm 4. hmmpress all.hmm 5. hmmscan -o /dev/null --domtblout ~/EnTrI/results/beneficialloss-plasmid/hmmscan.domtblout -E 1e-10 ~/EnTrI/results/beneficialloss-plasmid/all.hmm ~/EnTrI/data/fasta-plasmid/seqdb.fasta 6. python beneficialloss-plasmid.py ============ Generate giant table #1. python giant-tab-1-del.py #2. cd results/giant-tab/clusters/ mkdir alignments && mkdir profiles for filename in clust*; do mafft --text --quiet $filename > alignments/$filename.msa; done cd alignments for filename in *.msa; do hmmbuild -o /dev/null ../profiles/$filename $filename; done cd ../profiles cat * > corefam.hmm cd .. mkdir hhmake for filename in alignments/*; do hhmake -i $filename -o hhmake/"entcore-$(basename $filename | cut -d. -f1).hmm" -M 50; done #1. python giant-tab-1.py 1. cd EnTrI/results/all-hieranoid/ find fasta-genome/ -maxdepth 1 -name "*.fa*" -exec ~/program-bank/phylosift_v1.0.1/phylosift search --isolate --besthit {} \; find fasta-genome/ -maxdepth 1 -name "*.fa*" -exec ~/program-bank/phylosift_v1.0.1/phylosift align --isolate --besthit {} \; for f in PS_temp/*/alignDir/concat.updated.1.fasta; do cat "$f" >> protein_alignment.fa; done raxmlHPC -s protein_alignment.fa -n phylosift-aa.raxmlbootstrap -m PROTGAMMALG4M -p 1234 -f a -x 1234 -# 100 2. Copy all the protein fasta sequences to program-bank/hieranoid/data/sequences and replace all '<>s with _prime(). 3. Copy the species tree to program-bank/hieranoid/data/tree, root it, and add node names to it. 4. Edit program-bank/hieranoid/Configurations/Configuration.pm 5. cd program-bank/hieranoid/ 6. perl hieranoid.pl 7. for all protein sequences run cat * > seqdb.fasta to have all sequences together 8. edit the addresses and run python3 edit-hieranoid.py 9. run annotate-clust.py and run eggnog-mapper on the output (cluster-representatives.txt) to find gene names for clusters. 10. run giant-tab-1.py, giant-tab-2.py and giant-tab-3.R
About
No description, website, or topics provided.
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published