💡 Ideas for the interpretability module of ZairaChem #49

miquelduranfrigola · 2024-10-27T00:21:07Z

I am sharing below some ideas for the interpretability module of ZairaChem. This is a relatively unstructured issue, let's just use this as a starting point of a more organized sub-project.

Note that we have a preliminary project called XAI4Chem developed by @HellenNamulinda that can be helpful, especially in terms of API design.

General location within ZairaChem

In my opinion, interpretability should not be done on existing ZairaChem models. Instead, we need to train models dedicated to this particular task. In that case, performance is obviously important, but it is OK if it is inferior to the ensemble ZairaChem performance.

Therefore, the interpretability module can go at the end of the pipeline, before or after Olinda. Placing it after would allow us to train XAI models on precalculated data, which can actually be interesting. At the moment, we don't know if it is better to train XAI models on original data or on precalculated data. Both options are appealing.

We'll have to produce dedicated plots for interpretability.

Shapley values

We need to apply Shapley value analysis. For this, I would straightly go with XGBoost (with Optuna). The real question is what descriptors to use. I don't think there is a good answer. I would recommend that we have different "interpreters". For example, the physicochemical interpreter, the substructure interpreter, etc. Generally, there is no point in having columns that are not human-readable/understandable or cannot be mapped onto structures visually.

Based on my limited experience, we definitely need the following:

A short physchem descriptors list of about 10-20 descriptors like logP, MW, etc. I think Datamol has those.
A mid-size physchem descriptors list (~200), for example, all the RDKit descriptors. I think Datamol has those too.
A long physchem descriptors list, like PaDEL or Mordred. I am not a huge fan of those (not very interpretable), but they will help us know if we are missing something compared to the short and mid-size ones.
A Morgan fingerprint or similar that we can use to map bits to substructures. I personally don't find this very useful, but some people do.

Shapley analysis is pretty straightforward and the SHAP library is awesome.

Finally, it is perhaps worth considering the "feature permutation" method provided by AutoGluon. I've used it previously in other contexts and it is quite robust.

Counterfactuals

The only counterfactuals library that I know is ExMol, but I am sure there are more. In any case, ExMol is straightforward to use, both for physchem and fingerprint descriptors. I do like the idea behing ExMol although, in my hands, I've never been able to translate it to actionable feedback for chemists. I haven't tried much, though.

Attention

People are using attention layers on chemical language models or graph networks to assign attention scores to SMILES characters or nodes/edges in the graph. It makes a lot of sense, although I've never done it myself. I would definitely try it, especially if we find a good paper with code. However, I would not do it in the first iteration.

Large language models

Lately, I have been experimenting a lot with LLMs and I am quite impressed with the summarizing possibilities. One thing to think about is the following: given a set of "important" features detected for a particular molecule (for example, certain physchem descriptors and certain substructures), can we generate a summary for the chemist that explains "why" the molecule is predicted to be active? I am quite sure that, with a good prompt engineering, we should be able to do this. If we combine this with a good RAG system explaining exactly what each descriptors means, then I anticipate that the LLM can be quite powerful. Again, not the first thing to try, but this could be quite cool.

Visualization

On thing we need to do for sure is an effective visualization. For the physicochemical descriptors, I think SHAP has it very well covered and there is no need for new types of plots.

For the substructure interpretation, the only good way is to map it to chemical structures, which many people has done already. The ExMol also offers an interesting PCA-based ways of interpreting.

Bottom-line is: I would not come up with new types of plots unless we find a strong need. I did try something new in this paper (based on 2D maps) and it did not add much value, so I would keep it simple.

Finally, let's explore the literature. I am not up to speed with it and I am sure there is interesting progress in the last year. What is clear to me is that, following the same philosophy of ZairaChem, there is no single good method and it will be better if we apply a few representative options. Choosing descriptors carefully will be key.

JHlozek · 2024-12-11T15:34:21Z

Updates on where this stands:

The base version of the XAI4Chem package highlights atoms in the chemical structures if the respective atoms are present in any of the top 5 most important Morgan fingerprint substructures. This is usually the majority of the compound and not very informative. I've made a fork of the XAI4Chem repo and refactored the code so that the average SHAP value at each atom is calculated instead and the colour intensity when highlighting the atom is scaled accordingly.
I explored a recent method in literature that Miquel suggested called MolAnchors. After experimenting with the code, the approach only seemed to find a molecular anchor for 10-20% of the compounds queried. So I don't find this method very practical/insightful.
I started looking at PubChem fingerprints and MACCS keys but neither seem to be readily reversible back to chemical structure. The ExMol package though does seem to have this implemented with MACCS keys, so we can look there if we're wanting to re-implement it.
I'll still be exploring the shapley analysis module with the short/medium/long interpretable descriptors and see how informative the Chemists find each option in the new year.
The interpretability module for highlight substructures is on pause until the new year, where we can brainstorm some more. Miquel will also tinker on the side.

miquelduranfrigola · 2024-12-17T08:53:21Z

Hi @JHlozek

Adding in the discussion a few articles/repos that may be useful for the structural intrepretation of model outputs:

If I recall correctly, ChemProp v1 had interpretability options but this remains under construction in v2: https://chemprop.readthedocs.io/ . Still, worth checking.
This idea makes sense: https://www.nature.com/articles/s42004-024-01155-w
This seems to provide molecular attribution out of the box: https://pubs.acs.org/doi/10.1021/acs.jcim.3c00396
This looks like a good paper but perhaps out of scope: https://www.nature.com/articles/s41467-024-48567-9
This perspective has some interesting ideas, including the use of LLMs to offer a human-readable explanation: https://pmc.ncbi.nlm.nih.gov/articles/PMC10134429/
This article looks cool although perhaps too complicated at the moment: https://www.nature.com/articles/s42256-023-00654-0
Valence labs has been developing some interesting (open source) tools lately. Perhaps this blogpost helps: https://portal.valencelabs.com/blogs/post/a-more-natural-interpretability-of-molecular-property-predictions-through-AnvDbSn3g30qKYX
This preprint seems to offer a natural way of merging Shapley with structural mapping: https://arxiv.org/html/2405.16041v2 . However, I am not sure if they provide the code.

I hope this helps..!

JHlozek · 2024-12-18T11:47:27Z

I've explored the shapley analysis module using the different length descriptors: datamol (small), rdkit desc (med), mordred (large). Only the datamol descriptors are readily interpretable. The plan is to largely stick with this short set of descriptors and I'll just check if we may want to extend this with any individual features from the other packages.

I will also look further into the substructure aspect based on the suggested links in the previous post.

miquelduranfrigola added this to AI2050 Oct 27, 2024

miquelduranfrigola moved this to Todo in AI2050 Oct 27, 2024

miquelduranfrigola assigned JHlozek Oct 27, 2024

JHlozek mentioned this issue Dec 12, 2024

Adapt ZairaChem to regression tasks #31

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

💡 Ideas for the interpretability module of ZairaChem #49

💡 Ideas for the interpretability module of ZairaChem #49

miquelduranfrigola commented Oct 27, 2024

JHlozek commented Dec 11, 2024 •

edited

Loading

miquelduranfrigola commented Dec 17, 2024

JHlozek commented Dec 18, 2024

💡 Ideas for the interpretability module of ZairaChem #49

💡 Ideas for the interpretability module of ZairaChem #49

Comments

miquelduranfrigola commented Oct 27, 2024

General location within ZairaChem

Shapley values

Counterfactuals

Attention

Large language models

Visualization

JHlozek commented Dec 11, 2024 • edited Loading

miquelduranfrigola commented Dec 17, 2024

JHlozek commented Dec 18, 2024

JHlozek commented Dec 11, 2024 •

edited

Loading