ZairaChem is the first library of Ersilia's family of tools devoted to providing out-of-the-box machine learning solutions for biomedical problems. In this case, we have focused on (Q)SAR models. (Q)SAR models take chemical structures as input and give as output predicted properties, typically pharmacological properties such as bioactivity against a certain target.
Both Ersilia and Zaira are cities described in Italo Calvino's book 'Invisible Cities' (1972). Ersilia is a "trading city" where inhabitants stretch strings from the corners of the houses to establish the relationships that sustain the life of the city. When the strings become too numerous, they rebuild Ersilia elsewhere, and their network of relationships remains. Zaira is a "city of memories". It contains its own past written in every corner, scratched in every pole, window and bannister.
Clone the repository in your local system
git clone https://github.com/ersilia-os/zaira-chem.git
cd zaira-chem
From the terminal, run the installation script:
bash install_linux.sh
By default, a Conda enviroment named zairachem
will be created. Activate it:
conda activate zairachem
ZairaChem can be run as a command line interface. To learn more about the ZairaChem commands, see the help command_
zairachem --help
ZairaChem expects a comma- or tab-separated file containing two columns: a "smiles" column with the molecules in SMILES format and an "activity" column with the activity values.
To get started, let's load an example classification task from Therapeutic Data Commons.
zairachem example --file_name input.csv
This file can be split into train and test sets.
zairachem split -i input.csv
The command above will generate two files your working directory, named train.csv and test.csv. By default, the train:test ratio is 80:20.
You can train a model as follows:
zairachem fit -i train.csv -m model
This command will run the full ZairaChem pipeline and produce a model folder with processed data, model checkpoints, and reports. If no cut-off is specified for the classification, ZairaChem will establish an internal cut-off to determine Category 0 and category 1. The output results will always provide the probability of a molecule being Category 1. Alternatively, you can set your preferred cuto-off with the following command:
zairachem fit -i train.csv -c 0.1 -d low -m model
Where the '-c' indicates the cut-off of the activity values and the '-d' specifies the direction. If set to 'low', values <= c will be considered 1 and if set to 'high', values => c will be considered 1.
You can then run predictions on the test set:
zairachem predict -i test.csv -m model -o test
ZairaChem will run predictions using the checkpoints stored in model and store results in the test directory. Several performance plots will be generated alongside prediction outputs.
You can distill a more compact version of the model with the built-in Olinda[https://github.com/ersilia-os/olinda] pipeline:
zairachem distill -m path_to_zairachem_model -o model.onnx
You can then run predictions through the new Olinda ONNX model with the same ZairaChem cli command:
zairachem predict -i test.csv -m model.onnx -o test
For further technical details, please read the ZairaChem page of the Ersilia gitbook, which describes each major step in the ZairaChem pipeline. The corresponding publication for the ZairaChem pipeline is available here.
If you use ZairaChem, please cite us:
@article{Turon2023,
author = {Turon, G. and Hlozek, J. and Woodland, J.G. and et al.},
title = {First fully-automated AI/ML virtual screening cascade implemented at a drug discovery centre in Africa},
journal = {Nat Commun},
volume = {14},
pages = {5736},
year = {2023},
doi = {10.1038/s41467-023-41512-2},
url = {https://doi.org/10.1038/s41467-023-41512-2}
}
Learn about the Ersilia Open Source Initiative!