Test Olinda #3

GemmaTuron · 2024-05-08T09:30:08Z

Let's test Olinda as it was originally developed and see what modifications we might need

JHlozek · 2024-05-15T06:53:47Z

I have managed to get the pipeline working for both the example in the notebook and for an Ersilia model.

The following dependencies needed updates:

tensorflow to version = "^2.9, <2.16.0"
ersilia to "^0.1.31"
pandas to "v1.3.5"

For the example notebook, I kept getting a runtime error that I still do not understand:
tuner_error.log
The error in the KerasTuner class whenever there is a second model trained (i.e. for the final model after the search process has completed).

After a lot of trial and error, I could workaround it by writing over the first model instantiation but this is maybe something to ask Ank if he has any insight for.

For an Ersilia model, I needed to make a few code modifications to correctly process the output format. For now it is specific to the eos97yu (PAMPA) model but will need an output adapter in future to standardise prediction output format across models.

JHlozek · 2024-05-27T06:37:52Z

I've implemented a few quality-of-life features:

Traditional Morgan fingerprints (without flat2grid)
Usage of a 100 molecule test set to speedily run through the whole pipeline
Conversion of the final model to ONNX with code for saving, loading and running models

miquelduranfrigola · 2024-05-28T06:30:13Z

This is fantastic @JHlozek . @GemmaTuron I am happy with merging the PR (if we haven't yet). Perhaps let's just tag the current version (i.e. the one developed by @leoank ) and then we build on top of this with the new additions, mostly focused on ZairaChem integration.

GemmaTuron · 2024-05-28T13:21:00Z

I've named v1 Ank's version so we can move on with merging the new code when it is ready

leoank · 2024-05-28T21:07:51Z

Let me know if you need any help with this!

JHlozek · 2024-06-05T06:51:50Z

I have updated Olinda to be able to specify the training set size (not just 100 or everything anymore) and worked through a bug around usage of the MorganFingerprints. I've also updated the demo notebook accordingly.

As discussed, I'll update the readme and the create the PR before moving onto ZairaChem.

Thanks, @leoank. I encountered a confusing issue during the final model training step which I'm hoping you may have some insight. I'll create a separate issue with some more information and tag you.

JHlozek · 2024-06-09T21:23:17Z

Hi @miquelduranfrigola, I've created the pull request for Olinda: #5
Once approved, I'll open two issues for Ank's insight.

I'm going to move onto the ZairaChem model implementation with 1k pre-calculated descriptors.

GemmaTuron · 2024-06-10T05:57:22Z

Hi @JHlozek
As we discussed in the last meeting, can you add a few simple test instructions to check that everything works?
thanks!

miquelduranfrigola · 2024-06-10T23:53:53Z

Thanks @GemmaTuron and @JHlozek 👌

GemmaTuron · 2024-06-12T07:08:55Z

test this notebook: https://github.com/ersilia-os/olinda/tree/main/notebooks

GemmaTuron · 2024-07-10T12:51:49Z

Hi @JHlozek

I am unable to install Olinda, the problem seems the Lap package but I've tried several options and I have not succeded. I've also revised the dependencies specified in pyproject and there are a few things that should be probably updated:

LAP not working (attached log)
RDKIT installing twice? We probably do not need the rdkit-pypi distribution
Pandas is commented out, do we need it or not?
The version does not match, we released v1.0.0 and now we are working in v2?
Let me know when you can have a look at this

olinda.txt

GemmaTuron · 2024-07-16T14:10:51Z

Hi,
I have made progress on the installation of Olinda (independent of ZairaChem at the moment). The Lap package was giving issues so I have switched to Lapx. To that end, I have also had to update Griddify, which is now available as a pip package using lapx as well instead of Lap

I am still running into issues with Keras and Tensorflow (see attached log) olinda.log
It seems we need Keras > 3 but it cannot go with our TensorFlow version (2.15) - I have not yet found a combination that works here. @JHlozek can you share the list of dependencies actually installed in your env? (with and without ZairaChem, if possible)

Thanks!

JHlozek · 2024-07-16T15:00:19Z

Hi @GemmaTuron,

Here is the conda env list for my Olinda_only install and then with Zairachem (where I am now doing all the development):
olinda_env.txt
zairachem_env.txt

To the previous points above:

RDKit: The double install is from the original code. I'm not sure why it is there though and I didn't want to change too much in the troubleshooting phase. Maybe @leoank can provide some insight if both are needed or if either are safe to remove?
Pandas is required for Olinda but it was conflicting with the model hub and it currently uses pandas from that installation. If Ersilia wants Olinda to eventually be a standalone package, then we'll need to specify the version here still.
Quite right for the version tags. I haven't worked with these in github before though. I believe I need to merge and rebase my local branch on the latest commit in Ersilia/olinda to have that in my local. I just need to figure this out and might ask for some pointers here to not lose my many new local commits.

GemmaTuron · 2024-07-16T15:14:34Z

Hi @JHlozek

That is very helpful. I have updated the pyproject.toml:

One single rdkit version
Pandas will no longer cause problems as Ersilia does not require pandas any longer, we can specify the version (not done yet)
Lap requirement substituted for Lapx (it was still installing Lap through Griddify, which I have now removed)
Version in pyproject now 0.2.0 - we can chat about best versioning tags

When I try to run your demo notebook, though, I run into issues with Keras (apparently, needs keras v3 to by pass this error: cannot import name 'ops' from 'keras') and Keras v3 does not work with Tensorflow 2.15.
But in your envs you do have keras AND tensorflow both on versions 2.15. You do not run into the same issue as I do?

GemmaTuron · 2024-07-17T10:31:50Z

Hi @JHlozek

Good news, I have been able to figure out the dependency hell. The latest problem was being caused by keras-nlp, which was installing the latest version as it was not specified. v0.14 only works with Keras v3, hence the clash. I have specified it in the pyproject.toml file as keras-nlp=0.12.0 and it now works fine.

Regarding the actual testing of olinda, I am able to run the demo notebook ONLY if I remove the clean=True flag from the distill function: student_model = distill(model, clean=True, num_data=10000)
I assume the clean is to remove files after the test, but it is actually deleting files before the model uses them, and hence it crashes. Is this a bug, or I am getting something wrong?

Thanks

Gemma

JHlozek · 2024-07-17T19:55:29Z

Hi @GemmaTuron

That is good news. I'm guessing keras-nlp has been updated in the last month but my laptop had a older version cached that it was drawing on.

For the cleaning issue, that is a bug. I noticed it when I started implementing ZairaChem compatibility so I've fixed the original code in a later commit.
Line 303 in distillation.py will check if the files exist before attempting to clean them.

Maybe next week we aim to test Olinda with ZairaChem on your side? I'll let you know when the code is more stable after I've implemented the sample_weights. I'm hoping this might also help prevent the model collapsing to the mean, otherwise I'll open the separate issue for further input.

leoank · 2024-07-17T21:26:16Z

Hey @JHlozek,

There were some issues with RDKit on Apple M1 chips when I started the project.
So, I defined two versions of RDKit in the dependecies with constraints. The constrainsts would only allow one version to get installed on Apple M1 and the other version to install on other systems. It's safe to remove the current version of rdkit works on all platforms.

miquelduranfrigola · 2024-07-17T23:07:32Z

Thanks @leoank this is good to know. Generally, pip installation of rdkit works in all platforms now.

GemmaTuron · 2024-07-18T10:23:49Z

Ok @JHlozek , sounds good!

Maybe we need to figure out how to start collaborating more closely, the changes you are making in your branch should probably be merged soon into the main code so we work from the same one. Let's see if we manage to discuss this next week.

GemmaTuron · 2024-08-19T09:17:31Z

Hello @JHlozek

What is the status of this? When will the code be merged into the main branch?

GemmaTuron · 2024-09-18T07:20:03Z

Olinda is almost ready, the next steps to close this as a production-ready tool:

Document usage of Olinda in the README files (@JHlozek)
Test Olinda and ZairaChem from Jason's fork (@GemmaTuron)
Benchmark Olinda from TDC models & inc. in gitbook (@JHlozek)

JHlozek · 2024-09-21T09:39:37Z

I have updated the README file for Olinda. I think we should move the more advanced customization section somewhere else, perhaps the gitbook? For now it is in olinda_customization.md

To test the pipeline: install zairachem from the JHlozek fork, train a fresh ZairaChem model, then follow the 'Usage' steps in the README to distill it.

Ongoing work:
I've been trying out different training setups, playing with the original zairachem training data and the ChEMBL reference library:

Different number of training epochs: 3 vs 30
No original ZairaChem training data
Validation using only ZairaChem data vs a random shuffle including ChEMBL data
Modifying the original ZairaChem predictions to correct the scores based on the experimental ground truth
Training directly on the original ZairaChem binary inputs

For each of these setups, I've varied the number of ChEMBL reference training points (1k, 10k, 50k, 100k) and how the training scores are weighted for Olinda (unweighted, weighting_by_predicted_class, weighting_by_zaira_vs_chembl_origin, combined_weighting_of_class_and_origin)

The above testing is complete for the high data scenario of the H3D plasmodium_NF54 model at a 0.5 uM cutoff. I'm working on condensing the metrics to a single plot and I'm now testing the same configs for a low-data caco model.

This data will be used to inform the choice of the default Olinda setup, which will then be benchmarked on TDC datasets.

miquelduranfrigola · 2024-09-23T21:58:42Z

Thanks @JHlozek!

GemmaTuron · 2024-10-03T11:51:23Z

Hi @JHlozek

I have tried from your forks but I am basically stuck at installing ZairaChem. I recall you saying you made some edits and it works for you, so please could you look at this issue?
Also can you confirm that you are able to install ZairaChem and in which machine? I'll pick this up when I am back on the 14th.

Thanks!

JHlozek · 2024-10-11T15:51:54Z

Hi @GemmaTuron

I've updated the ZairaChem issue you linked. I had the same problem with AutoGluon but I managed to find a temporary workaround for now.

I'm able to install my updated ZairaChem fork into both a Rocky Linux H3D workstation and my own Ubuntu laptop and run the fit/predict commands.
Let me know if it'll be helpful to jump into a call at some point (15th onwards) to tackle this further.

GemmaTuron · 2024-10-15T13:48:13Z

Hi @JHlozek

I get this message when I follow the instructions in your olinda fork:

from olinda import distill

student_model = distill("path_to_model")
student_model.save("path_to_save.onnx")

I am trying to distill a zairachem model I just trained, using the same zairachem environment that in principle has olinda installed as they should work together:

Traceback (most recent call last):
  File "/home/gturon/github/olinda.py", line 1, in <module>
    from olinda import distill
  File "/home/gturon/github/olinda.py", line 1, in <module>
    from olinda import distill
ImportError: cannot import name 'distill' from partially initialized module 'olinda' (most likely due to a circular import) (/home/gturon/github/olinda.py)

let me know what am I doing wrong if you have found this before! thx

JHlozek · 2024-10-15T17:56:14Z

Hi @GemmaTuron

Unfortunately, it doesn't look familiar. :/
I've reinstalled from your install_linux.sh script and the import worked on my laptop.

The only thing I can think of is that maybe our requirements.txt might be different if you're not installing directly from my ZairaChem fork?
Maybe also try to just 'import olinda' and see what happens?

GemmaTuron · 2024-10-16T06:55:55Z

Hi @JHlozek

When I try Olinda in an isolated environment this happens: indeed there seems to be a circular import between Olinda and ZairaChem which needs to be fixed. Ideally, Zairachem would install Olinda, but Olinda itself should remain independent from ZairaChem, as we want to distill all kinds of models, not only ZairaChem ones. Let's chat about this now

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/gturon/github/olinda/src/olinda/__init__.py", line 6, in <module>
    from olinda.distillation import distill  # noqa: F401
  File "/home/gturon/github/olinda/src/olinda/distillation.py", line 25, in <module>
    from olinda.data import ReferenceSmilesDM, FeaturizedSmilesDM, GenericOutputDM
  File "/home/gturon/github/olinda/src/olinda/data/__init__.py", line 3, in <module>
    from olinda.data.featurized_smiles import FeaturizedSmilesDM  # noqa: F401
  File "/home/gturon/github/olinda/src/olinda/data/featurized_smiles.py", line 10, in <module>
    from olinda.featurizer import Featurizer, MorganFeaturizer
  File "/home/gturon/github/olinda/src/olinda/featurizer.py", line 15, in <module>
    from olinda.utils import get_package_root_path
  File "/home/gturon/github/olinda/src/olinda/utils.py", line 16, in <module>
    from olinda.models import zairachem
  File "/home/gturon/github/olinda/src/olinda/models/zairachem.py", line 2, in <module>
    import zairachem
ModuleNotFoundError: No module named 'zairachem'

miquelduranfrigola · 2024-10-16T09:15:03Z

Hi, as agreed, @GemmaTuron and I will look into it and report back.

GemmaTuron · 2024-10-16T16:00:42Z

ok the circular import was my fault, I have solved it.

I cannot run the pipeline because my models do not have Mordred descriptors (a whole another realm of reasons) and Olinda does not have modularity, which it should (see #12 ). On my end I am working to solve the Mordred issue meanwhile

GemmaTuron · 2024-11-05T17:56:11Z

Update here:

@JHlozek has improved Olinda's code which now checks which descriptors are being calculated. We will merge @JHlozek fork and take it from there to solve dependency issues with ZairaChem and Ersilia. I will close this issue which is too general and open issues as I start working on the code if there is need.

GemmaTuron added this to AI2050 May 8, 2024

GemmaTuron converted this from a draft issue May 8, 2024

GemmaTuron assigned JHlozek May 8, 2024

GemmaTuron mentioned this issue May 8, 2024

Distilled ZairaChem models in ONNX format ersilia-os/zaira-chem#32

Open

2 tasks

JHlozek moved this from Todo to In Progress in AI2050 May 9, 2024

JHlozek mentioned this issue May 9, 2024

Pip install fails #1

Closed

JHlozek moved this from In Progress to Done in AI2050 Jun 12, 2024

GemmaTuron self-assigned this Jun 12, 2024

GemmaTuron moved this from Done to In Progress in AI2050 Jun 12, 2024

GemmaTuron closed this as completed Nov 5, 2024

github-project-automation bot moved this from In Progress to Done in AI2050 Nov 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test Olinda #3

Test Olinda #3

GemmaTuron commented May 8, 2024 •

edited

Loading

JHlozek commented May 15, 2024

JHlozek commented May 27, 2024

miquelduranfrigola commented May 28, 2024

GemmaTuron commented May 28, 2024

leoank commented May 28, 2024

JHlozek commented Jun 5, 2024

JHlozek commented Jun 9, 2024 •

edited

Loading

GemmaTuron commented Jun 10, 2024

miquelduranfrigola commented Jun 10, 2024

GemmaTuron commented Jun 12, 2024

GemmaTuron commented Jul 10, 2024

GemmaTuron commented Jul 16, 2024

JHlozek commented Jul 16, 2024

GemmaTuron commented Jul 16, 2024

GemmaTuron commented Jul 17, 2024

JHlozek commented Jul 17, 2024

leoank commented Jul 17, 2024

miquelduranfrigola commented Jul 17, 2024

GemmaTuron commented Jul 18, 2024 •

edited

Loading

GemmaTuron commented Aug 19, 2024

GemmaTuron commented Sep 18, 2024

JHlozek commented Sep 21, 2024

miquelduranfrigola commented Sep 23, 2024

GemmaTuron commented Oct 3, 2024

JHlozek commented Oct 11, 2024

GemmaTuron commented Oct 15, 2024

JHlozek commented Oct 15, 2024

GemmaTuron commented Oct 16, 2024

miquelduranfrigola commented Oct 16, 2024

GemmaTuron commented Oct 16, 2024

GemmaTuron commented Nov 5, 2024

Test Olinda #3

Test Olinda #3

Comments

GemmaTuron commented May 8, 2024 • edited Loading

JHlozek commented May 15, 2024

JHlozek commented May 27, 2024

miquelduranfrigola commented May 28, 2024

GemmaTuron commented May 28, 2024

leoank commented May 28, 2024

JHlozek commented Jun 5, 2024

JHlozek commented Jun 9, 2024 • edited Loading

GemmaTuron commented Jun 10, 2024

miquelduranfrigola commented Jun 10, 2024

GemmaTuron commented Jun 12, 2024

GemmaTuron commented Jul 10, 2024

GemmaTuron commented Jul 16, 2024

JHlozek commented Jul 16, 2024

GemmaTuron commented Jul 16, 2024

GemmaTuron commented Jul 17, 2024

JHlozek commented Jul 17, 2024

leoank commented Jul 17, 2024

miquelduranfrigola commented Jul 17, 2024

GemmaTuron commented Jul 18, 2024 • edited Loading

GemmaTuron commented Aug 19, 2024

GemmaTuron commented Sep 18, 2024

JHlozek commented Sep 21, 2024

miquelduranfrigola commented Sep 23, 2024

GemmaTuron commented Oct 3, 2024

JHlozek commented Oct 11, 2024

GemmaTuron commented Oct 15, 2024

JHlozek commented Oct 15, 2024

GemmaTuron commented Oct 16, 2024

miquelduranfrigola commented Oct 16, 2024

GemmaTuron commented Oct 16, 2024

GemmaTuron commented Nov 5, 2024

GemmaTuron commented May 8, 2024 •

edited

Loading

JHlozek commented Jun 9, 2024 •

edited

Loading

GemmaTuron commented Jul 18, 2024 •

edited

Loading