Association between niche adaptation and evolution of carbohydrate active enzymes in Pectobacteriaceae
This repository contains supplementary information for analyses reported in Hobbs et al. (2023), exploring the diversity in the Carbohydrate Active enZyme (CAZyme) complement and association with the plant host range of Pectobacteriaceae.
You can find the full report, exploring the CAZomes here.
A citation for this work will be added once available. At the present please cite this repository as the source, the DOI:10.5281/zenodo.7699655, and the authors (in order): Emma E. M. Hobbs1,2,3, Tracey, M. Gloster1, Leighton Prichard2.
- School of Biology and Biomedical Sciences Research Complex, University of St Andrews, North Haugh, St Andrews, Fife, KY16 9ST, UK
- Strathclyde Institute of Pharmacy and Biomedical Sciences, University of Strathclyde, Glasgow G4 ORE, UK
- Cell and Molecular Sciences, James Hutton Institute, Invergowrie, Dundee DD2 5DA, UK
@misc{Hobbs2023,
author = {Emma E. M. Hobbs and Tracey M. Gloster and Leighton Pritchard},
title = {Association between niche adaptation and evolution of carbohydrate active enzymes in Pectobacteriaceae},
howpublished = {\url{https://hobnobmancer.github.io/SI_Hobbs_et_al_2024_Pecto/}},
year = {2023},
note = {Version 1. DOI:10.5281/zenodo.7699655}
}
To repeat analyses, run all commands provided in the walkthrough from the root of this directory.
All raw figure files presented in the complete report in the manuscript can be found in the results/
directory.
Owing to the size of the data sets used, the figures are consequently compressed in the final manuscript. This remote repository contains the original full size, high resolution figures.
Additionally, some analyses are only briefly mentioned in the manuscript. The full method and results of these analyses are stored in this repository.
For the complete analysis of the CAZyme complements (i.e. the CAZomes) are available in the jupyter notebook
.
Find a full list of the results here.
You can use this repository like a website, to browse and see how we performed the analysis, or you can download it to inspect, verify, reproduce and build on our analysis.
You can use git
to clone this repository to your local hard drive:
git clone [email protected]:HobnobMancer/SI_Hobbs_et_al_2023_Pecto.git
Or you can download it as a compressed .zip
archive from this link.
Please raise an issue at the corresponding GitHub
page:
The structure of this repository:
.
├── LICENSE
├── README.md
├── _config.yml
├── data
│ ├── README.md
│ ├── cazomes
│ │ ├── coinfinder_pecto_fam_genomes
│ │ ├── coinfinder_pecto_fam_genomes_taxs
│ │ ├── pecto_fam_genomes
│ │ ├── pecto_fam_genomes_proteins
│ │ ├── pecto_fam_genomes_proteins_taxs
│ │ └── pecto_fam_genomes_taxs
│ ├── genomes
│ │ ├── classes.txt
│ │ └── labels.txt
│ ├── genomic_accessions
│ │ ├── genomes_for_coinfinder.txt
│ │ └── pectobact_accessions
│ ├── missing_genomes
│ └── tree
│ └── ani_tree
│ ├── anim_matrix.tab
│ ├── genomes.tab
│ ├── logs
│ ├── matrix_aln_lengths_4.tab
│ ├── matrix_aln_lengths_run4.pdf
│ ├── matrix_coverage_4.tab
│ ├── matrix_coverage_run4.pdf
│ ├── matrix_hadamard_4.tab
│ ├── matrix_hadamard_run4.pdf
│ ├── matrix_identity_4.tab
│ ├── matrix_identity_run4.pdf
│ ├── matrix_sim_errors_4.tab
│ ├── matrix_sim_errors_run4.pdf
│ ├── pyani_ani_tree.new
│ ├── pyani_ani_tree_taxs.new
│ └── scatter_identity_vs_coverage_run4.pdf
├── notebooks
│ ├── explore_pecto_dic_cazomes.html
│ ├── explore_pectobact_cazomes.html
│ └── explore_pectobact_cazomes.ipynb
├── requirements.txt
├── results
│ ├── cazome_size
│ │ └── cazome_sizes.csv
│ ├── cazy_classes
│ │ └── cazy_class_sizes.csv
│ ├── cazy_families
│ │ ├── cazy_fam_freqs.csv
│ │ ├── fam_freq_clustermap.svg
│ │ ├── paper_fam_freq_clustermap.png
│ │ ├── paper_fam_freq_clustermap_FILTERED.svg
│ │ ├── paper_genus_species_fam_freq_clustermap.png
│ │ ├── paper_pheno_genus_fam_freq_clustermap.png
│ │ └── unique_grp_fams.tsv
│ ├── cooccurring_families
│ │ ├── cooccurring_fams_freqs.csv
│ │ ├── fam_corr_M_filled.csv
│ │ ├── paper-cooccurring_fams_freqs.csv
│ │ ├── paper-pecto-cooccurring-families.svg
│ │ └── pecto-cooccurring-families.svg
│ ├── core_cazome
│ │ ├── core_cazome_freqs.csv
│ │ ├── genera_core_cazome.svg
│ │ └── genera_soft_hard_core_cazome.svg
│ └── pca
│ ├── PC1-vs-PC2
│ │ ├── pca_pc1_vs_pc2-genus.png
│ │ ├── pca_pc1_vs_pc2-loadings_plot.png
│ │ └── pca_pc1_vs_pc2-species.png
│ ├── PC1-vs-PC3
│ │ ├── pca_pc1_vs_pc3-genus.png
│ │ ├── pca_pc1_vs_pc3-loadings_plot.png
│ │ └── pca_pc1_vs_pc3-species.png
│ ├── PC1-vs-PC4
│ │ ├── pca_pc1_vs_pc4-genus.png
│ │ ├── pca_pc1_vs_pc4-loadings_plot.png
│ │ └── pca_pc1_vs_pc4-species.png
│ ├── PC2-vs-PC3
│ │ ├── pca_pc2_vs_pc3-genus.png
│ │ ├── pca_pc2_vs_pc3-loadings_plot.png
│ │ └── pca_pc2_vs_pc3-species.png
│ ├── PC2-vs-PC4
│ │ ├── pca_pc2_vs_pc4-genus.png
│ │ ├── pca_pc2_vs_pc4-loadings_plot.png
│ │ └── pca_pc2_vs_pc4-species.png
│ ├── PC3-vs-PC4
│ │ ├── pca_pc3_vs_pc4-genus.png
│ │ ├── pca_pc3_vs_pc4-loadings_plot.png
│ │ └── pca_pc3_vs_pc4-species.png
│ ├── pca_explained_variance.png
│ ├── pca_pc_screen_genus.svg
│ ├── pca_pc_screen_species.svg
│ └── pectobact_pca_scree.png
├── scripts
│ ├── README.md
│ ├── annotate_cazome
│ │ ├── get_cazy_cazymes.sh
│ │ ├── get_dbcan_cazymes.sh
│ │ └── run_dbcan.sh
│ ├── coevolution
│ │ ├── find_coevolving_pectobact.sh
│ │ ├── find_coevolving_pectobact_with_tax.sh
│ │ ├── pectobact_circular_network.R
│ │ └── pectobact_taxs_rectangular_network.R
│ ├── download
│ │ ├── annotate_genomes.sh
│ │ ├── build_cazyme_database.sh
│ │ ├── download_genomes.sh
│ │ ├── download_ms_genomes.sh
│ │ └── ident_missing_proteomes.py
│ ├── taxs
│ │ ├── add_ani_tax.py
│ │ └── add_taxs.sh
│ └── tree
│ ├── README.md
│ └── ani
│ ├── build_anim_tree.R
│ ├── build_anim_tree.sh
│ ├── parse_anim_tab.py
│ └── run_anim.sh
└── structure
28 directories, 94 files
You can use this archive to browse, validate, reproduce, or build on the phylogenomics analysis for the Hobbs et al. (2023) manuscript.
We recommend creating a conda environment specific for this activity, for example using the commands:
conda create -n pectobacteriaceae python=3.9 -y
conda activate pectobacteriaceae
conda install --file requirements.txt -y -c bioconda -c conda-forge -c predector
To use pyani
in this analysis, version 0.3+ must be installed. At the time of development, pyani
v0.3+ must be installed from source
, this can be done by using the bash script install_pyani_v0-3x.sh
(run from the root of this repository):
scripts/download/install_pyani_v0-3x.sh
The installation instructions for dbCAN
v==2.0.11 can be found here and were followed to install dbCAN for the analysis presented in the manuscript.
- Download datasets
download_genomes.sh
- Search and download all Pectobacteriaceae genomes in NCBIdownload_ms_genomes.sh
- Download the genomes used in the manuscriptident_missing_protomes.py
- Identify genomes were a .faa file was not availableannotate_genomes.sh
- Predicte proteome using Prodigalbuild_cazyme_db.sh
- Build a local CAZyme db
- Annotate CAZomes
get_cazy_cazymes.sh
- Retrieve CAZy family annotations from the local CAZyme db for Pectobacteriaceaerun_dbcan_dbcan.sh
- Run dbCAN version 2 on Pectobacteriaceae proteomesget_dbcan_cazymes.sh
- Parse dbCAN output
- Run ANI analysis and build dendrogram
run_anim.sh
build_anim_tree.sh
build_anim_tree.R
- `parse_anim_tab.py
- Add taxonomic classifications
add_taxs.sh
add_ani_tax.py
- Explore CAZome composition
explore_pectobact_cazome.ipynb
- Compare trees
build_tanglegrams.R
- Identify networkds of co-evolving CAZy families
find_colevolving_pectobact.sh
find_colevolving_pectobact_with_tax.sh
Owing to the size of the data sets used, the figures are consequently compressed in the final manuscript. This remote repository contains the original full size, high resolution figures.
The original figures are found in the results
directory, and contained within the jupyter notebooks
used to run the analyses, which can be found here (the raw notebooks are for downloading and re-running locally, the website version are for viewing the results):
Several of the data files required to repeat the analyses presented in the manuscript are stored (available for use) in the repo. These files are stored in the data/
directory:
Configure using cazy_webscraper
(Hobbs _et al., 2022) to download all data from the CAZy database, and compile the data into a local CAZyme database.
cazy_webscraper: local compilation and interrogation of comprehensive CAZyme datasets Emma E. M. Hobbs, Tracey M. Gloster, Leighton Pritchard bioRxiv 2022.12.02.518825; doi: https://doi.org/10.1101/2022.12.02.518825
# create a local CAZyme database
scripts/build_cazyme_db.sh <email>
This generated the local CAZyme database data/cazy/cazy_db
.
cazomevolve
was used to download all complete Pectobacteriaceae genomic assemblies (in genome sequence and protein sequence FASTA file format) from NCBI Assembly, by querying the NCBI Taxonomy database and retrieving all genomic assemblies linked to the Pectobacteriaceae (NCBI:txid1903410). To repeat this method, run the following command from the root of this directory:
# download Pectobacteriaceae genomes from GenBank
scripts/download/download_genomes.sh <email>
Note: With the continual addition of new genomic assemblies to the NCBI Assembly database, repeating the download of Pectobacteriaceae genomes may generate a different dataset to that presented in Hobbs et al.. To repeat the analysis presented in the manuscript, run the following command from the root of the directory to configure ncbi-genome-download
to download the 660 genomic assemblies of the genomes used in the manuscript:
scripts/download/download_ms_genomes.sh
In both cases, the downloaded genomic sequence files were written to the dir data/genomes
, the downloaded protein FASTA files were written to data/proteomes
.
Not all genomic assemblies in NCBI are annotated, i.e. a proteome FASTA file (.faa
file) is not available for all genomic sequences in NCBI.
To identify those genomes were a proteome FASTA file was not available, and thus was not downloaded, the Python script ident_missing_protomes.py
was run.
scripts/download/ident_missing_proteomes.py
The script generated a text file listing the genomic accession of each assembly for which a proteome FASTA file (.faa
) file was not downloaded. The file was written to data/missing_genomes
.
If using the 717 assemblies presented in the manuscript, proteome FASTA files were not available for 107 assemblies.
The script annotate_genomes.sh
coordinates running prodigal
on all genome sequences were a proteome FASTA file could not be retrieved, and copies the predicted proteome FASTA file to the data/proteome
directory.
scripts/download/annotates_genomes.sh
Configure using cazomevolve
to identify CAZymes classified in the local CAZyme database, for both the Pectobacteriaceae.
scripts/annotate_cazome/get_cazy_cazymes.sh
Two tab delimited lists were created:
- Listing the CAZy family accession and genomic accession per line:
data/cazomes/pecto_fam_genomes
- Listing the CAZy family, genomic accession and protein accession per line:
data/cazomes/pecto_fam_genomes_proteins
Proteins in the download protein FASTA files that were not listed in the local CAZyme database were written to data/cazomes/dbcan_input
for Pectobacteriaceae.
To retrieve the most comprehensive CAZome for each genome, protein sequences not found in the local CAZyme database were parsed by the CAZyme classifier dbCAN
(Zhang et al. 2018), configured using cazomevolve
.
Han Zhang, Tanner Yohe, Le Huang, Sarah Entwistle, Peizhi Wu, Zhenglu Yang, Peter K Busk, Ying Xu, Yanbin Yin; dbCAN2: a meta server for automated carbohydrate-active enzyme annotation, Nucleic Acids Research, Volume 46, Issue W1, 2 July 2018, Pages W95–W101, https://doi.org/10.1093/nar/gky418
Run the following command from the root of this directory. Note: depending on the computer resources this may take multiple days to complete
Note: The following commands MUST be run from the same directory containing the db
directory created when installing dbCAN
- the following commands presumt the db
dir is located in the root of this repository.
scripts/run_dbcan.sh
After running dbCAN, use the following commands to parse the output from dbCAN and add the predicted CAZy family annotations, protein accessions and genomic accessions to the tab delimited lists created above.
The command runs the cazomevolve
command cazevolve_get_dbcan
which can be used to parse the output from dbCAN
version 2 and version 3.
scripts/get_dbcan_cazymes.sh
At the end, two plain text files will be generated, containing tab separated data:
The Pectobacteriaceae lists were written to:
data/cazomes/pecto_fam_genomes
data/cazomes/pecto_fam_genomes_proteins
The software package pyani
Pritchard et al was used to perform an average nucleotide identify (ANI) comparison between all pairs of Pectobacteriaceae genomes, using the ANIm method.
Pritchard et al. (2016) "Genomics and taxonomy in diagnostics for food security: soft-rotting enterobacterial plant pathogens" Anal. Methods 8, 12-24
scripts/tree/ani/run_anim.sh
This created a pyani database in data/tree
. Graphical outputs summarising the pyani analysis were written to results/tree/anim
.
May need to run to double check what output is created, and how to get the tsv file for generating the dendrogram.
A dendrogram was reconstructed from the ANIm analysis using the bash script build_anim_tree.sh
, which coordinated the extraction of the calclated ANI values from the local pyani
database, replacing the pyani
genome IDs with the NCBI genomic version accessions using the Python script parse_anim_tab.py
, and coordinated the R script build_anim_tree.R
, which build a distance matrix and used hierarchical clustering (using the 'single' method) to build a dendorgram that was written in Newick format to data/tree/pyani_ani_tree.new
:
scripts/tree/ani/build_anim_tree.sh
Download the GTDB database dump from the GTDB repository. Release 202.0 was used in the manuscript Hobbs et al. Save the database dump (TSV file) to data/gtdb/
directory.
The bash script add_tax.sh
was used to coordinate running cazomevolve
to add taxonomic information to each genomic accession, in every tab delimited list of (i) CAZy family and genomic accession, and (ii) CAZy family, genomic accession and protein accession that was generated.
scripts/taxs/add_tax.sh <use email address> <path to gtdb tsv file>
Use Python script add_ani_tax.py
to add the taxonomic information to the reconstructed ANI trees.
scripts/taxs/add_ani_tax.py
Exploration of the CAZomes in the data set was preformed within a jupyter notebook
, which is available in this repository (the raw notebooks is for downloading and re-running locally, the website version is for viewing the results):
Specifically, the analyses performed in the notebook was executed using the module cazomevolve.cazome.explore
, which contains functions for exploring the CAZome annotated by cazomevolve
.
The R script build_rarefaction_plots.R
was used to estimate the degree of diversity and completeness of the CAZome annotations in the dataset, specifically using the R package Vegan
(Dixon et al., 2003).
Dixon, P. (2003), VEGAN, a package of R functions for community ecology. Journal of Vegetation Science, 14: 927-930. https://doi.org/10.1111/j.1654-1103.2003.tb02228.x
To repeat the analysis, run the following bash command from the root of the reposistory after having the Jupyter Notebook (otherwise the script will be unable to find the necessary input files):
scripts/rare_factions/build_rarefaction_plots.R
Use the tool coinfinder
(Whelan et al.) to identify CAZy families that are present in the genome together more often than expected by chance and lineage.
Fiona J. Whelan, Martin Rusilowicz, & James O. McInerney. "Coinfinder: detecting significant associations and dissociations in pangenomes." doi: https://doi.org/10.1099/mgen.0.000338
Generate circular trees and heatmaps:
To reproduce the output from coinfinder
in the same structure as presented in the manuscript (i.e. a circular tree surrounded by a heatmap), overwrite the file network.R
in coinfinder
with the respective R script in scripts/coevolution
, and use the corresponding bash script:
network.R
:scripts/coevolution_circular_network.R
- bash:
scripts/coevolution/find_coevolving_pectobact.sh
Generate linear trees and heatmaps, with taxonomy information:
The circular heatmap annotates each leaf of the tree with only the respective genomic version accession. To list the taxonomic infomration as well, on each leaf of the tree, overwrite the contents in the file network.R
in coinfinder
with the respective R script in scripts/coevolution
, and use the respective bash script to configure coinfinder
:
network.R
:scripts/coevolution_taxs_rectangular_network.R
- bash:
scripts/coevolution/find_coevolving_pectobact_with_tax.sh