Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test Olinda #3

Closed
GemmaTuron opened this issue May 8, 2024 · 31 comments
Closed

Test Olinda #3

GemmaTuron opened this issue May 8, 2024 · 31 comments
Assignees

Comments

@GemmaTuron
Copy link
Member

GemmaTuron commented May 8, 2024

Let's test Olinda as it was originally developed and see what modifications we might need

@GemmaTuron GemmaTuron added this to AI2050 May 8, 2024
@GemmaTuron GemmaTuron converted this from a draft issue May 8, 2024
@JHlozek JHlozek moved this from Todo to In Progress in AI2050 May 9, 2024
@JHlozek JHlozek mentioned this issue May 9, 2024
@JHlozek
Copy link
Contributor

JHlozek commented May 15, 2024

I have managed to get the pipeline working for both the example in the notebook and for an Ersilia model.

  1. The following dependencies needed updates:
  • tensorflow to version = "^2.9, <2.16.0"
  • ersilia to "^0.1.31"
  • pandas to "v1.3.5"
  1. For the example notebook, I kept getting a runtime error that I still do not understand:
    tuner_error.log
    The error in the KerasTuner class whenever there is a second model trained (i.e. for the final model after the search process has completed).

After a lot of trial and error, I could workaround it by writing over the first model instantiation but this is maybe something to ask Ank if he has any insight for.

  1. For an Ersilia model, I needed to make a few code modifications to correctly process the output format. For now it is specific to the eos97yu (PAMPA) model but will need an output adapter in future to standardise prediction output format across models.

@JHlozek
Copy link
Contributor

JHlozek commented May 27, 2024

I've implemented a few quality-of-life features:

  1. Traditional Morgan fingerprints (without flat2grid)
  2. Usage of a 100 molecule test set to speedily run through the whole pipeline
  3. Conversion of the final model to ONNX with code for saving, loading and running models

@miquelduranfrigola
Copy link
Member

This is fantastic @JHlozek . @GemmaTuron I am happy with merging the PR (if we haven't yet). Perhaps let's just tag the current version (i.e. the one developed by @leoank ) and then we build on top of this with the new additions, mostly focused on ZairaChem integration.

@GemmaTuron
Copy link
Member Author

I've named v1 Ank's version so we can move on with merging the new code when it is ready

@leoank
Copy link
Collaborator

leoank commented May 28, 2024

Let me know if you need any help with this!

@JHlozek
Copy link
Contributor

JHlozek commented Jun 5, 2024

I have updated Olinda to be able to specify the training set size (not just 100 or everything anymore) and worked through a bug around usage of the MorganFingerprints. I've also updated the demo notebook accordingly.

As discussed, I'll update the readme and the create the PR before moving onto ZairaChem.

Thanks, @leoank. I encountered a confusing issue during the final model training step which I'm hoping you may have some insight. I'll create a separate issue with some more information and tag you.

@JHlozek
Copy link
Contributor

JHlozek commented Jun 9, 2024

Hi @miquelduranfrigola, I've created the pull request for Olinda: #5
Once approved, I'll open two issues for Ank's insight.

I'm going to move onto the ZairaChem model implementation with 1k pre-calculated descriptors.

@GemmaTuron
Copy link
Member Author

Hi @JHlozek
As we discussed in the last meeting, can you add a few simple test instructions to check that everything works?
thanks!

@miquelduranfrigola
Copy link
Member

Thanks @GemmaTuron and @JHlozek 👌

@JHlozek JHlozek moved this from In Progress to Done in AI2050 Jun 12, 2024
@GemmaTuron
Copy link
Member Author

test this notebook: https://github.com/ersilia-os/olinda/tree/main/notebooks

@GemmaTuron GemmaTuron self-assigned this Jun 12, 2024
@GemmaTuron GemmaTuron moved this from Done to In Progress in AI2050 Jun 12, 2024
@GemmaTuron
Copy link
Member Author

Hi @JHlozek

I am unable to install Olinda, the problem seems the Lap package but I've tried several options and I have not succeded. I've also revised the dependencies specified in pyproject and there are a few things that should be probably updated:

  • LAP not working (attached log)
  • RDKIT installing twice? We probably do not need the rdkit-pypi distribution
  • Pandas is commented out, do we need it or not?
  • The version does not match, we released v1.0.0 and now we are working in v2?
    Let me know when you can have a look at this

olinda.txt

@GemmaTuron
Copy link
Member Author

Hi,
I have made progress on the installation of Olinda (independent of ZairaChem at the moment). The Lap package was giving issues so I have switched to Lapx. To that end, I have also had to update Griddify, which is now available as a pip package using lapx as well instead of Lap

I am still running into issues with Keras and Tensorflow (see attached log) olinda.log
It seems we need Keras > 3 but it cannot go with our TensorFlow version (2.15) - I have not yet found a combination that works here. @JHlozek can you share the list of dependencies actually installed in your env? (with and without ZairaChem, if possible)

Thanks!

@JHlozek
Copy link
Contributor

JHlozek commented Jul 16, 2024

Hi @GemmaTuron,

Here is the conda env list for my Olinda_only install and then with Zairachem (where I am now doing all the development):
olinda_env.txt
zairachem_env.txt

To the previous points above:

  • RDKit: The double install is from the original code. I'm not sure why it is there though and I didn't want to change too much in the troubleshooting phase. Maybe @leoank can provide some insight if both are needed or if either are safe to remove?
  • Pandas is required for Olinda but it was conflicting with the model hub and it currently uses pandas from that installation. If Ersilia wants Olinda to eventually be a standalone package, then we'll need to specify the version here still.
  • Quite right for the version tags. I haven't worked with these in github before though. I believe I need to merge and rebase my local branch on the latest commit in Ersilia/olinda to have that in my local. I just need to figure this out and might ask for some pointers here to not lose my many new local commits.

@GemmaTuron
Copy link
Member Author

Hi @JHlozek

That is very helpful. I have updated the pyproject.toml:

  • One single rdkit version
  • Pandas will no longer cause problems as Ersilia does not require pandas any longer, we can specify the version (not done yet)
  • Lap requirement substituted for Lapx (it was still installing Lap through Griddify, which I have now removed)
  • Version in pyproject now 0.2.0 - we can chat about best versioning tags

When I try to run your demo notebook, though, I run into issues with Keras (apparently, needs keras v3 to by pass this error: cannot import name 'ops' from 'keras') and Keras v3 does not work with Tensorflow 2.15.
But in your envs you do have keras AND tensorflow both on versions 2.15. You do not run into the same issue as I do?

@GemmaTuron
Copy link
Member Author

Hi @JHlozek

Good news, I have been able to figure out the dependency hell. The latest problem was being caused by keras-nlp, which was installing the latest version as it was not specified. v0.14 only works with Keras v3, hence the clash. I have specified it in the pyproject.toml file as keras-nlp=0.12.0 and it now works fine.

Regarding the actual testing of olinda, I am able to run the demo notebook ONLY if I remove the clean=True flag from the distill function: student_model = distill(model, clean=True, num_data=10000)
I assume the clean is to remove files after the test, but it is actually deleting files before the model uses them, and hence it crashes. Is this a bug, or I am getting something wrong?

Thanks

Gemma

@JHlozek
Copy link
Contributor

JHlozek commented Jul 17, 2024

Hi @GemmaTuron

That is good news. I'm guessing keras-nlp has been updated in the last month but my laptop had a older version cached that it was drawing on.

For the cleaning issue, that is a bug. I noticed it when I started implementing ZairaChem compatibility so I've fixed the original code in a later commit.
Line 303 in distillation.py will check if the files exist before attempting to clean them.

Maybe next week we aim to test Olinda with ZairaChem on your side? I'll let you know when the code is more stable after I've implemented the sample_weights. I'm hoping this might also help prevent the model collapsing to the mean, otherwise I'll open the separate issue for further input.

@leoank
Copy link
Collaborator

leoank commented Jul 17, 2024

Hey @JHlozek,

There were some issues with RDKit on Apple M1 chips when I started the project.
So, I defined two versions of RDKit in the dependecies with constraints. The constrainsts would only allow one version to get installed on Apple M1 and the other version to install on other systems. It's safe to remove the current version of rdkit works on all platforms.

@miquelduranfrigola
Copy link
Member

Thanks @leoank this is good to know. Generally, pip installation of rdkit works in all platforms now.

@GemmaTuron
Copy link
Member Author

GemmaTuron commented Jul 18, 2024

Ok @JHlozek , sounds good!

Maybe we need to figure out how to start collaborating more closely, the changes you are making in your branch should probably be merged soon into the main code so we work from the same one. Let's see if we manage to discuss this next week.

@GemmaTuron
Copy link
Member Author

Hello @JHlozek

What is the status of this? When will the code be merged into the main branch?

@GemmaTuron
Copy link
Member Author

Olinda is almost ready, the next steps to close this as a production-ready tool:

  • Document usage of Olinda in the README files (@JHlozek)
  • Test Olinda and ZairaChem from Jason's fork (@GemmaTuron)
  • Benchmark Olinda from TDC models & inc. in gitbook (@JHlozek)

@JHlozek
Copy link
Contributor

JHlozek commented Sep 21, 2024

I have updated the README file for Olinda. I think we should move the more advanced customization section somewhere else, perhaps the gitbook? For now it is in olinda_customization.md

To test the pipeline: install zairachem from the JHlozek fork, train a fresh ZairaChem model, then follow the 'Usage' steps in the README to distill it.

Ongoing work:
I've been trying out different training setups, playing with the original zairachem training data and the ChEMBL reference library:

  • Different number of training epochs: 3 vs 30
  • No original ZairaChem training data
  • Validation using only ZairaChem data vs a random shuffle including ChEMBL data
  • Modifying the original ZairaChem predictions to correct the scores based on the experimental ground truth
  • Training directly on the original ZairaChem binary inputs

For each of these setups, I've varied the number of ChEMBL reference training points (1k, 10k, 50k, 100k) and how the training scores are weighted for Olinda (unweighted, weighting_by_predicted_class, weighting_by_zaira_vs_chembl_origin, combined_weighting_of_class_and_origin)

The above testing is complete for the high data scenario of the H3D plasmodium_NF54 model at a 0.5 uM cutoff. I'm working on condensing the metrics to a single plot and I'm now testing the same configs for a low-data caco model.

This data will be used to inform the choice of the default Olinda setup, which will then be benchmarked on TDC datasets.

@miquelduranfrigola
Copy link
Member

Thanks @JHlozek!

@GemmaTuron
Copy link
Member Author

Hi @JHlozek

I have tried from your forks but I am basically stuck at installing ZairaChem. I recall you saying you made some edits and it works for you, so please could you look at this issue?
Also can you confirm that you are able to install ZairaChem and in which machine? I'll pick this up when I am back on the 14th.

Thanks!

@JHlozek
Copy link
Contributor

JHlozek commented Oct 11, 2024

Hi @GemmaTuron

I've updated the ZairaChem issue you linked. I had the same problem with AutoGluon but I managed to find a temporary workaround for now.

I'm able to install my updated ZairaChem fork into both a Rocky Linux H3D workstation and my own Ubuntu laptop and run the fit/predict commands.
Let me know if it'll be helpful to jump into a call at some point (15th onwards) to tackle this further.

@GemmaTuron
Copy link
Member Author

Hi @JHlozek

I get this message when I follow the instructions in your olinda fork:

from olinda import distill

student_model = distill("path_to_model")
student_model.save("path_to_save.onnx")

I am trying to distill a zairachem model I just trained, using the same zairachem environment that in principle has olinda installed as they should work together:

Traceback (most recent call last):
  File "/home/gturon/github/olinda.py", line 1, in <module>
    from olinda import distill
  File "/home/gturon/github/olinda.py", line 1, in <module>
    from olinda import distill
ImportError: cannot import name 'distill' from partially initialized module 'olinda' (most likely due to a circular import) (/home/gturon/github/olinda.py)

let me know what am I doing wrong if you have found this before! thx

@JHlozek
Copy link
Contributor

JHlozek commented Oct 15, 2024

Hi @GemmaTuron

Unfortunately, it doesn't look familiar. :/
I've reinstalled from your install_linux.sh script and the import worked on my laptop.

The only thing I can think of is that maybe our requirements.txt might be different if you're not installing directly from my ZairaChem fork?
Maybe also try to just 'import olinda' and see what happens?

@GemmaTuron
Copy link
Member Author

Hi @JHlozek

When I try Olinda in an isolated environment this happens: indeed there seems to be a circular import between Olinda and ZairaChem which needs to be fixed. Ideally, Zairachem would install Olinda, but Olinda itself should remain independent from ZairaChem, as we want to distill all kinds of models, not only ZairaChem ones. Let's chat about this now

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/gturon/github/olinda/src/olinda/__init__.py", line 6, in <module>
    from olinda.distillation import distill  # noqa: F401
  File "/home/gturon/github/olinda/src/olinda/distillation.py", line 25, in <module>
    from olinda.data import ReferenceSmilesDM, FeaturizedSmilesDM, GenericOutputDM
  File "/home/gturon/github/olinda/src/olinda/data/__init__.py", line 3, in <module>
    from olinda.data.featurized_smiles import FeaturizedSmilesDM  # noqa: F401
  File "/home/gturon/github/olinda/src/olinda/data/featurized_smiles.py", line 10, in <module>
    from olinda.featurizer import Featurizer, MorganFeaturizer
  File "/home/gturon/github/olinda/src/olinda/featurizer.py", line 15, in <module>
    from olinda.utils import get_package_root_path
  File "/home/gturon/github/olinda/src/olinda/utils.py", line 16, in <module>
    from olinda.models import zairachem
  File "/home/gturon/github/olinda/src/olinda/models/zairachem.py", line 2, in <module>
    import zairachem
ModuleNotFoundError: No module named 'zairachem'

@miquelduranfrigola
Copy link
Member

Hi, as agreed, @GemmaTuron and I will look into it and report back.

@GemmaTuron
Copy link
Member Author

ok the circular import was my fault, I have solved it.

I cannot run the pipeline because my models do not have Mordred descriptors (a whole another realm of reasons) and Olinda does not have modularity, which it should (see #12 ). On my end I am working to solve the Mordred issue meanwhile

@GemmaTuron
Copy link
Member Author

Update here:

@JHlozek has improved Olinda's code which now checks which descriptors are being calculated. We will merge @JHlozek fork and take it from there to solve dependency issues with ZairaChem and Ersilia. I will close this issue which is too general and open issues as I start working on the code if there is need.

@github-project-automation github-project-automation bot moved this from In Progress to Done in AI2050 Nov 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

No branches or pull requests

4 participants