Distilled ZairaChem models in ONNX format #32

miquelduranfrigola · 2023-12-05T15:50:31Z

Motivation

ZairaChem models are large and will always be large, since ZairaChem uses an ensemble-based approach. Nonetheless, we would like to offer the opportunity to distill ZairaChem models for easier deployment, especially in online inference. We'd like to do it in an interoperable format such as ONNX.

The Olinda package

Our colleague @leoank already contributed a fantastic package named Olinda that we could, in principle, use for this purpose. Olinda takes an arbitrary model (in this case, a ZairaChem model) and produces a much simpler model, stored in ONNX format. Olinda uses a reference library to do the teacher/student training and is nicely coupled with other tools that @leoank developed such as ChemXOR for privacy-preserving AI and Ersilia Compound Embedding which provides dense 1024-dimensional embeddings.

Roadmap

Proof of principle: We first need to show that model distillation can be done based on ZairaChem's checkpoints.
Distillation module in ZairaChem: Then, we need to add a distillation module in ZairaChem, to be run at the end of model training, that performs the distillation procedure as part of the training pipeline.

GemmaTuron · 2024-05-08T09:44:28Z

We will start by testing again Olinda @JHlozek (see: ersilia-os/olinda#3)

JHlozek · 2024-06-12T06:47:52Z

I've been working on this and currently I have Olinda installed in the ZairaChem environment (requiring some dependency conflict resolution as usual).
I have a version of the code that can invoke the ZairaChem pipeline and collect its output to complete the pipeline, so in principle this works.

There are still many improvements to work on next, including:

suppression of ZairaChem output
use of pre-calculated descriptors
merge model training set with chembl data for Olinda training set

miquelduranfrigola · 2024-06-13T08:05:22Z

Thanks @JHlozek this is great.

JHlozek · 2024-07-04T20:06:07Z

Olinda updates:
The ZairaChem distillation process runs successfully, now with pre-calculated descriptors too.
This includes the above points of suppressing the extensive output produced by ZairaChem and merging the model training set with the pre-calculated reference descriptors.

As a test, I trained a model on H3D data up to June 2023 including 1k pre-calculated reference descriptors and then predicted prospective compounds from the next 6 months. Here is the scatter plot of how the distilled and zairachem model predictions compare and a ROC-curve for the distilled model on prospective data.

To facilitate testing, I have written code that will prepare a folder of pre-calculated descriptors for 1k molecules, which can be run in the first cell in the demo notebook.
For testing, perform the following steps:

Install ZairaChem with the following dependency changes:
install_linux.txt(change to .sh file)
requirements.txt
Install Olinda into the ZairaChem conda environment with:
python3 -m pip install git+https://github.com/JHlozek/olinda.git
It seems like there is an issue in docker with mixed git+https links which I fixed by downgrading requests==2.29.0
Install jupyter:
conda install -c conda-forge jupyterlab
Open olinda/notebooks/demo_quick.ipynb and run the first cell to produce the pre-calculated descriptors for 1k molecules
Update the paths in the Distillation section for the ZairaChem model to be distilled and the save path for the ONNX model
Run all the Distillation cells

I suggest testing this and then closing #3 to keep the conversation centralized here.
I'll post next steps following this.

JHlozek · 2024-07-04T20:06:37Z

Next steps:

Performance testing for different sized pre-calculated sets (1k, 10k, 100k) both with and without the ZairaChem training set
Finish calculating 100k descriptor sets (just eos4u6p to go)
The base pipeline's output is in a quite heavily nested list format which could be tweaked to improve usability
Investigate speed improvements. The following currently still need to be calculated at runtime:
- mellody-tuner
- treated descriptors (only minor)
- manifolds
- tabpfn
- eos59rr
- eos6m4j

miquelduranfrigola · 2024-07-05T11:11:13Z

This is very interesting and promising, @JHlozek !
There seems to be a tendency towards false negatives (upper-left triangle in your plot). This is interesting and hopefully can be ameliorated with (a) more data and/or (b) including the training set.
Great progress here! Exciting

GemmaTuron · 2024-07-10T07:27:53Z

Summary of the weekly meeting: the distilled models look good but there seems to be a bit of underfitting as we add external data, so we need to make the ONNX model a bit more complex.
In addition, we will look for data to validate the "generizability" of the model - from H3D data (@JHlozek ) and ChEMBL (@GemmaTuron )

GemmaTuron · 2024-07-10T12:07:59Z

Hi @JHlozek

I have a dataset that contains IC50 data for P.Falciparum, over 17K molecules with Active (1) and Inactive (0) defined at two cut-offs (hc = high cut-off, 2.5 uM / lc = low cut-off, 10 uM). They are curated from ChEMBL - all public data
I do not have the strain (it is a pool) but we can assume most of it will be in sensitive strains, and likely NF54.
Let me know if these are useful!

pfalciparum_IC50_hc.csv

pfalciparum_IC50_lc.csv

miquelduranfrigola · 2024-07-16T12:43:24Z

This looks pretty good @GemmaTuron - many thanks.

JHlozek · 2024-07-18T12:34:10Z

Some updates for Olinda that we spoke about yesterday. I have been working on improving the speed of the Olinda pipeline by addressing the list above of steps that need to be run at runtime. I am concurrently writing the script that can convert a given reference list of smiles into the expected directory structure.

eos59rr and eos6m4j: these are now pre-calculated and I've updated my fork of ZairaChem to check if the pre-calculated .np files exist before calculating them. The Ersilia models output vectors with a single dimension but the descriptors are natively two-dimensional. ZairaChem converts them in a post-processing step that I copy but we discussed considering changing the Ersilia model outputs in future.
Mellody-tuner: is run as part of the ZairaChem's setup pipeline. The steps of this pipeline that run mellody-tuner can be pre-calculated, which is now done within Olinda.
tabpfn: N_ensemble_configurations is now set to 1 within ZairaChem which reduces this step's time to 1/4 of before.
treated descriptors and manifolds do not take much time and I don't think are worth investing time in.

Overall, the pipeline has gone from >10 hours for 50k reference molecules to ~45 minutes. Half an hour of this is still due to the tabpfn step, which we may want to discuss addressing further in future.

Next, I am working on implementing the sample weights to weight the original model's training set higher than than the general reference smiles.

miquelduranfrigola · 2024-07-22T07:58:49Z

Fantastic @JHlozek thanks for the updates.

RE:

Exactly, ZairaChem can reshape the descriptors on-the-fly. This is the easiest at the moment but we can consider embedding this step within Ersilia once we have figured out how to pass parameters (e.g. "flat" or "square").
About MELLODDY Tuner, sounds great. @GemmaTuron we definitely need to refactor MELLODDY Tuner as an Ersilia model.
Great. As an FYI, I am now in direct contact with the TabPFN developers and I am hopeful that we will have access to a faster version in the near future.
Good insight. Let's not invest time in manifolds and treated descriptors.

Let's address TabPFN in our meeting.

About the weighting scheme - does it seem difficult?

JHlozek · 2024-07-25T11:38:40Z

Thanks @miquelduranfrigola. Some more updates:

The weighting is now implemented and wasn't too difficult - the generators just need to return a third value which KerasTuner automatically treats as the weight. At the moment I find the proportion of training compounds to the reference library and use the inverse as the weight. I'm exploring extending this weighting scheme to account for the large difference between low-scoring and high-scoring compounds.

I now have 200k compounds pre-calculated. We should maybe start thinking about how we store and serve these (like from an S3 bucket).

miquelduranfrigola · 2024-07-25T12:54:25Z

Very interesting. Thanks @JHlozek

100% agree that we need to have a data store for ZairaChem descriptors, and the right place to put this is S3. In principle, it should not be too difficult - they are in HDF5 format, correct?

Tagging @DhanshreeA so she is in the loop.

JHlozek · 2024-07-25T15:13:25Z

@miquelduranfrigola

Most of the descriptors are .h5. The two bidd-molmap files are .np array files and then there are some txt files in the formats that ZairaChem expects. We might want to zip each fold to a single file.

The folder structure for each 50k fold of data is as follows:
reference_library.csv
/data/data.csv
/data/data_schema.json
/data/mapping.csv
/descriptors/cc-signaturizer/raw.h5
/descriptors/grover-embedding/raw.h5
/descriptors/molfeat-chemgpt/raw.h5
/descriptors/mordred/raw.h5
/descriptors/rdkit-fingerprint/raw.h5
/descriptors/eosce.h5
/descriptors/reference.h5
/descriptors/bidd-molmap_desc.np
/descriptors/bidd-molmap_fps.np

I'm going to remove the duplication of grover embedding by pointing the manifolds to /grover-embedding/raw.h5 instead of the separate reference.h5 file.

miquelduranfrigola · 2024-07-25T21:44:22Z

Fantastic. Definitely, we need to keep this as zip files in S3 and perhaps write a short script to fetch those file easily?

GemmaTuron · 2024-10-16T07:53:52Z

Hi @JHlozek

As we near the completion of Olinda for ZairaChem, can you summarise in this issue the current status and performance of Olinda so we keep all information up to date and are able to close the issue once the tasks are completed?

Related to this issue, we are working on #46 and also in olinda issue 3 and olinda issue 7

JHlozek · 2024-10-17T14:31:43Z

The core of the Olinda pipeline is now complete and has been tested under various training configurations to identify a good inital setup for a v1 of the package (#46). Currently, the pipeline focuses on distilling ZairaChem models but can be extended to Ersilia Model Hub models once an adapter is developed to process the variable model output.

The resulting ONNX surrogate models are lightweight (< 5Mb) and very fast while maintaining the majority of the ZairaChem model performance. A simple wrapper api is available for the ONNX models (https://github.com/JHlozek/olinda_api) to facilitate programmatic use and will also be incorporated into ZairaChem cli commands.

Next steps:

Incremental updates for modularity, cli, etc ([Batch] Complete Olinda v1 olinda#8)
Agree on the initial configuration of Olinda v1
Benchmark v1 against TDC (and H3D models)

miquelduranfrigola · 2024-10-21T08:01:10Z

Thanks @JHlozek

GemmaTuron · 2024-11-05T18:06:00Z

I think this issue is ready to be closed once @JHlozek forks have been merged?

JHlozek · 2024-11-15T16:59:47Z

I see that the Olinda README has a small merge conflict but I don't want to mess with the PR now.
The highlighted section should just be removed though as it is no longer relevant.

I'll get back to any questions/comments/suggestions that you raise when I am back in office on Tuesday or whenever you get to it (no pressure from my side).

miquelduranfrigola added this to ZairaChem Dec 5, 2023

miquelduranfrigola converted this from a draft issue Dec 5, 2023

GemmaTuron assigned JHlozek May 8, 2024

GemmaTuron added this to AI2050 May 8, 2024

GemmaTuron moved this to Todo in AI2050 May 8, 2024

GemmaTuron moved this from Todo to In Progress in AI2050 May 10, 2024

GemmaTuron mentioned this issue Oct 16, 2024

[Batch] Complete Olinda v1 ersilia-os/olinda#8

Closed

8 tasks

GemmaTuron closed this as completed Nov 28, 2024

GemmaTuron reopened this Nov 28, 2024

github-project-automation bot moved this from Todo to Done in ZairaChem Nov 28, 2024

github-project-automation bot moved this from In Progress to Done in AI2050 Nov 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distilled ZairaChem models in ONNX format #32

Distilled ZairaChem models in ONNX format #32

miquelduranfrigola commented Dec 5, 2023 •

edited

Loading

GemmaTuron commented May 8, 2024

JHlozek commented Jun 12, 2024

miquelduranfrigola commented Jun 13, 2024

JHlozek commented Jul 4, 2024 •

edited

Loading

JHlozek commented Jul 4, 2024

miquelduranfrigola commented Jul 5, 2024

GemmaTuron commented Jul 10, 2024

GemmaTuron commented Jul 10, 2024

miquelduranfrigola commented Jul 16, 2024

JHlozek commented Jul 18, 2024

miquelduranfrigola commented Jul 22, 2024 •

edited

Loading

JHlozek commented Jul 25, 2024

miquelduranfrigola commented Jul 25, 2024

JHlozek commented Jul 25, 2024 •

edited

Loading

miquelduranfrigola commented Jul 25, 2024

GemmaTuron commented Oct 16, 2024

JHlozek commented Oct 17, 2024

miquelduranfrigola commented Oct 21, 2024

GemmaTuron commented Nov 5, 2024

JHlozek commented Nov 15, 2024

Distilled ZairaChem models in ONNX format #32

Distilled ZairaChem models in ONNX format #32

Comments

miquelduranfrigola commented Dec 5, 2023 • edited Loading

Motivation

The Olinda package

Roadmap

GemmaTuron commented May 8, 2024

JHlozek commented Jun 12, 2024

miquelduranfrigola commented Jun 13, 2024

JHlozek commented Jul 4, 2024 • edited Loading

JHlozek commented Jul 4, 2024

miquelduranfrigola commented Jul 5, 2024

GemmaTuron commented Jul 10, 2024

GemmaTuron commented Jul 10, 2024

miquelduranfrigola commented Jul 16, 2024

JHlozek commented Jul 18, 2024

miquelduranfrigola commented Jul 22, 2024 • edited Loading

JHlozek commented Jul 25, 2024

miquelduranfrigola commented Jul 25, 2024

JHlozek commented Jul 25, 2024 • edited Loading

miquelduranfrigola commented Jul 25, 2024

GemmaTuron commented Oct 16, 2024

JHlozek commented Oct 17, 2024

miquelduranfrigola commented Oct 21, 2024

GemmaTuron commented Nov 5, 2024

JHlozek commented Nov 15, 2024

miquelduranfrigola commented Dec 5, 2023 •

edited

Loading

JHlozek commented Jul 4, 2024 •

edited

Loading

miquelduranfrigola commented Jul 22, 2024 •

edited

Loading

JHlozek commented Jul 25, 2024 •

edited

Loading