Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

💡 Ideas for the interpretability module of ZairaChem #49

Open
miquelduranfrigola opened this issue Oct 27, 2024 · 3 comments
Open

💡 Ideas for the interpretability module of ZairaChem #49

miquelduranfrigola opened this issue Oct 27, 2024 · 3 comments
Assignees

Comments

@miquelduranfrigola
Copy link
Member

Hi @JHlozek!

I am sharing below some ideas for the interpretability module of ZairaChem. This is a relatively unstructured issue, let's just use this as a starting point of a more organized sub-project.

Note that we have a preliminary project called XAI4Chem developed by @HellenNamulinda that can be helpful, especially in terms of API design.

General location within ZairaChem

In my opinion, interpretability should not be done on existing ZairaChem models. Instead, we need to train models dedicated to this particular task. In that case, performance is obviously important, but it is OK if it is inferior to the ensemble ZairaChem performance.

Therefore, the interpretability module can go at the end of the pipeline, before or after Olinda. Placing it after would allow us to train XAI models on precalculated data, which can actually be interesting. At the moment, we don't know if it is better to train XAI models on original data or on precalculated data. Both options are appealing.

We'll have to produce dedicated plots for interpretability.

Shapley values

We need to apply Shapley value analysis. For this, I would straightly go with XGBoost (with Optuna). The real question is what descriptors to use. I don't think there is a good answer. I would recommend that we have different "interpreters". For example, the physicochemical interpreter, the substructure interpreter, etc. Generally, there is no point in having columns that are not human-readable/understandable or cannot be mapped onto structures visually.

Based on my limited experience, we definitely need the following:

  • A short physchem descriptors list of about 10-20 descriptors like logP, MW, etc. I think Datamol has those.
  • A mid-size physchem descriptors list (~200), for example, all the RDKit descriptors. I think Datamol has those too.
  • A long physchem descriptors list, like PaDEL or Mordred. I am not a huge fan of those (not very interpretable), but they will help us know if we are missing something compared to the short and mid-size ones.
  • A Morgan fingerprint or similar that we can use to map bits to substructures. I personally don't find this very useful, but some people do.

Shapley analysis is pretty straightforward and the SHAP library is awesome.

Finally, it is perhaps worth considering the "feature permutation" method provided by AutoGluon. I've used it previously in other contexts and it is quite robust.

Counterfactuals

The only counterfactuals library that I know is ExMol, but I am sure there are more. In any case, ExMol is straightforward to use, both for physchem and fingerprint descriptors. I do like the idea behing ExMol although, in my hands, I've never been able to translate it to actionable feedback for chemists. I haven't tried much, though.

Attention

People are using attention layers on chemical language models or graph networks to assign attention scores to SMILES characters or nodes/edges in the graph. It makes a lot of sense, although I've never done it myself. I would definitely try it, especially if we find a good paper with code. However, I would not do it in the first iteration.

Large language models

Lately, I have been experimenting a lot with LLMs and I am quite impressed with the summarizing possibilities. One thing to think about is the following: given a set of "important" features detected for a particular molecule (for example, certain physchem descriptors and certain substructures), can we generate a summary for the chemist that explains "why" the molecule is predicted to be active? I am quite sure that, with a good prompt engineering, we should be able to do this. If we combine this with a good RAG system explaining exactly what each descriptors means, then I anticipate that the LLM can be quite powerful. Again, not the first thing to try, but this could be quite cool.

Visualization

On thing we need to do for sure is an effective visualization. For the physicochemical descriptors, I think SHAP has it very well covered and there is no need for new types of plots.

For the substructure interpretation, the only good way is to map it to chemical structures, which many people has done already. The ExMol also offers an interesting PCA-based ways of interpreting.

Bottom-line is: I would not come up with new types of plots unless we find a strong need. I did try something new in this paper (based on 2D maps) and it did not add much value, so I would keep it simple.

Finally, let's explore the literature. I am not up to speed with it and I am sure there is interesting progress in the last year. What is clear to me is that, following the same philosophy of ZairaChem, there is no single good method and it will be better if we apply a few representative options. Choosing descriptors carefully will be key.

@JHlozek
Copy link
Collaborator

JHlozek commented Dec 11, 2024

Updates on where this stands:

  • The base version of the XAI4Chem package highlights atoms in the chemical structures if the respective atoms are present in any of the top 5 most important Morgan fingerprint substructures. This is usually the majority of the compound and not very informative. I've made a fork of the XAI4Chem repo and refactored the code so that the average SHAP value at each atom is calculated instead and the colour intensity when highlighting the atom is scaled accordingly.
  • I explored a recent method in literature that Miquel suggested called MolAnchors. After experimenting with the code, the approach only seemed to find a molecular anchor for 10-20% of the compounds queried. So I don't find this method very practical/insightful.
  • I started looking at PubChem fingerprints and MACCS keys but neither seem to be readily reversible back to chemical structure. The ExMol package though does seem to have this implemented with MACCS keys, so we can look there if we're wanting to re-implement it.
  • I'll still be exploring the shapley analysis module with the short/medium/long interpretable descriptors and see how informative the Chemists find each option in the new year.
  • The interpretability module for highlight substructures is on pause until the new year, where we can brainstorm some more. Miquel will also tinker on the side.

@miquelduranfrigola
Copy link
Member Author

Hi @JHlozek

Adding in the discussion a few articles/repos that may be useful for the structural intrepretation of model outputs:

I hope this helps..!

@JHlozek
Copy link
Collaborator

JHlozek commented Dec 18, 2024

I've explored the shapley analysis module using the different length descriptors: datamol (small), rdkit desc (med), mordred (large). Only the datamol descriptors are readily interpretable. The plan is to largely stick with this short set of descriptors and I'll just check if we may want to extend this with any individual features from the other packages.

I will also look further into the substructure aspect based on the suggested links in the previous post.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Todo
Development

No branches or pull requests

2 participants