-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
💡 Ideas for the interpretability module of ZairaChem #49
Comments
Updates on where this stands:
|
Hi @JHlozek Adding in the discussion a few articles/repos that may be useful for the structural intrepretation of model outputs:
I hope this helps..! |
I've explored the shapley analysis module using the different length descriptors: datamol (small), rdkit desc (med), mordred (large). Only the datamol descriptors are readily interpretable. The plan is to largely stick with this short set of descriptors and I'll just check if we may want to extend this with any individual features from the other packages. I will also look further into the substructure aspect based on the suggested links in the previous post. |
Hi @JHlozek!
I am sharing below some ideas for the interpretability module of ZairaChem. This is a relatively unstructured issue, let's just use this as a starting point of a more organized sub-project.
Note that we have a preliminary project called XAI4Chem developed by @HellenNamulinda that can be helpful, especially in terms of API design.
General location within ZairaChem
In my opinion, interpretability should not be done on existing ZairaChem models. Instead, we need to train models dedicated to this particular task. In that case, performance is obviously important, but it is OK if it is inferior to the ensemble ZairaChem performance.
Therefore, the interpretability module can go at the end of the pipeline, before or after Olinda. Placing it after would allow us to train XAI models on precalculated data, which can actually be interesting. At the moment, we don't know if it is better to train XAI models on original data or on precalculated data. Both options are appealing.
We'll have to produce dedicated plots for interpretability.
Shapley values
We need to apply Shapley value analysis. For this, I would straightly go with XGBoost (with Optuna). The real question is what descriptors to use. I don't think there is a good answer. I would recommend that we have different "interpreters". For example, the physicochemical interpreter, the substructure interpreter, etc. Generally, there is no point in having columns that are not human-readable/understandable or cannot be mapped onto structures visually.
Based on my limited experience, we definitely need the following:
Shapley analysis is pretty straightforward and the SHAP library is awesome.
Finally, it is perhaps worth considering the "feature permutation" method provided by AutoGluon. I've used it previously in other contexts and it is quite robust.
Counterfactuals
The only counterfactuals library that I know is ExMol, but I am sure there are more. In any case, ExMol is straightforward to use, both for physchem and fingerprint descriptors. I do like the idea behing ExMol although, in my hands, I've never been able to translate it to actionable feedback for chemists. I haven't tried much, though.
Attention
People are using attention layers on chemical language models or graph networks to assign attention scores to SMILES characters or nodes/edges in the graph. It makes a lot of sense, although I've never done it myself. I would definitely try it, especially if we find a good paper with code. However, I would not do it in the first iteration.
Large language models
Lately, I have been experimenting a lot with LLMs and I am quite impressed with the summarizing possibilities. One thing to think about is the following: given a set of "important" features detected for a particular molecule (for example, certain physchem descriptors and certain substructures), can we generate a summary for the chemist that explains "why" the molecule is predicted to be active? I am quite sure that, with a good prompt engineering, we should be able to do this. If we combine this with a good RAG system explaining exactly what each descriptors means, then I anticipate that the LLM can be quite powerful. Again, not the first thing to try, but this could be quite cool.
Visualization
On thing we need to do for sure is an effective visualization. For the physicochemical descriptors, I think SHAP has it very well covered and there is no need for new types of plots.
For the substructure interpretation, the only good way is to map it to chemical structures, which many people has done already. The ExMol also offers an interesting PCA-based ways of interpreting.
Bottom-line is: I would not come up with new types of plots unless we find a strong need. I did try something new in this paper (based on 2D maps) and it did not add much value, so I would keep it simple.
Finally, let's explore the literature. I am not up to speed with it and I am sure there is interesting progress in the last year. What is clear to me is that, following the same philosophy of ZairaChem, there is no single good method and it will be better if we apply a few representative options. Choosing descriptors carefully will be key.
The text was updated successfully, but these errors were encountered: