This repository contains the python implementation for the methodology described in:
Heller, D., Szklarczyk, D. and von Mering, C.: Tree reconciliation combined with subsampling improves large scale inference of orthologous group hierarchies (2018) manuscript in preparation
A preprint of the article can be found on bioRxiv at https://doi.org/10.1101/417840
The current version of the pipeline (v0.4) is a Snakemake workflow written in python3, which relies on the python tree library etetoolkit and the progress bar tqdm. By default the following software is used to compute and reconcile gene trees with species trees:
The binaries of the three tools are downloaded automatically using the snakemake rules specified in rules/tools.smk
.
Input files are specified through the configuration file config.yaml
, with parameters explained therein. As a small example we provide a dataset from the eggNOG database in the release section under data.tar.gz
.
The software has been developed and tested on Linux (Ubuntu 12/16/18.04). Other Unix systems might be suitable as well but binaries will have to be adapted accordingly.
NOTE: If you cloned the repository prior to the 13.11.2018, please make a fresh copy as we applied BFG Repo-Cleaner to remove the example data from the repository history (now found under the release section)
The easiest way to use the pipeline is to create a python3 environment with the Anaconda/Miniconda distribution (installation instructions here). Assuming that the distrution has been installed, the following commands create a new environment and install all the required dependencies:
# create a new environment named "smk"
conda create -n smk python=3.6
# activate the environment
source activate smk
# install the dependencies (snakemake, ete3, tqdm)
conda install -c bioconda -c conda-forge snakemake
conda install -c etetoolkit ete3 ete_toolchain
conda install -c conda-forge tqdm
Alternatively the dependencies can also be installed natively using pip or compiled from source by following the respective guides in their documentation.
The configuration file config.yaml
is predefined with the input parameters for the small example included in data.tar.gz
. The archive contains information regarding the Primates level of eggNOG and its two sublevels, Hominidae and Cercopithecoidea:
/-314294[prNOG-1][superfamily:Cercopithecoidea]
-9443[prNOG][order:Primates]--
\-9604[homNOG][family:Hominidae]
For the 15 member species of the Primates level (see data/9443.primates.species.tsv
), the data directory includes FASTA sequences (in data/fastafiles
) and orthologous group mappings (in data/orthologous_groups
) as well as the clades (in data/clades
).
To run the Snakemake workflow:
- download the example dataset
data.tar.gz
from the release section - expand the example dataset with
tar -xzf data.tar.gz
- (opt) list the outstanding tasks with
snakemake -n
orsnakemake --dag | dot -Tsvg > dag.svg
to visualize them as SVG graph - execute the tasks with
snakemake
- (opt) create a snakemake report with
snakemake --report report.html
The software will read the test dataset with 100 OGs from data/orthologous_groups
and resolve the hierarchical inconsistencies. After workflow completion (~2 min on a single core) the consistent OG definition can be found in test_output/consistent_ogs
. To run a larger example with the complete clustering of the 15 species, change the input parameter in the config.yaml
file to point at data/orthologous_groups_full
. Be aware that this will require much more time and multi-core execution is strongly reccomended (~1h using 10 cores, i.e. snakemake --cores 10
).
Feedback is always welcome. Feel free to write to [email protected]