Compare-xAI
is a library for benchmarking feature attribution / feature importance Explainable AI techniques using different tests.
See our NeurIPS paper (under review).
You can check directly the benchmark results at https://karim-53.github.io/cxai/
Table of contents generated with markdown-toc
All data are located in data/
The list of shortlisted tests is in data/01_raw/test.csv
.
Info about explainers could be found in data/03_experiment_output_aggregated/explainer.csv
Raw score results are in data/03_experiment_output_aggregated/cross_tab.csv
The data is also available as one SQLite database file data/04_sql/database
.
Want to reproduce the results shown in our paper ? Follow these instructions:
install the required packages using
pip install -r requirements.txt
Run the following command to explain the currently implemented tests using the currently implemented explainers.
python reset_experiment.py
The results are written in data/02_experiment_output/results.csv
.
Now run the following command to aggregate results in a more human-readable format.
python src/aggregate_data.py
This also generate an SQLite database used in https://karim-53.github.io/cxai/
data/04_sql/database
aggregate all data: information about tests, explainers, papers, and results of all experiments.
Tip: Reduce the list of explainers to test by changing valid_explainers
in src/explainer.py
. Same for the tests, see src/test.py
.
Experiments were run over a normal computer (see CPU-Z report) without a GPU.
Total execution time: 4h 18min 18sec
To add a new Explainer algorithm or a test to the benchmark, please follow the instructions below.
.1 Create a python script explainers/my_explainer.py
.
.2 Create MyExplainer
class that inherit from Explainer
superclass. Have a look at explainers/saabas.py
to better understand how to implement the explainer. Also do not hesitate to import a library and to add it to requirements.txt
.
.3 In src/explainer.py
add MyExplainer
to the list of valid_explainers
.
.4 Run reset_experiment.py
then run src/aggregate_data.py
.
src/aggregate_data.py
.1 Create a python script explainers/my_explainer.py
.
.2 Create MyTest
class that inherit from Test
superclass. Have a look at tests/cough_and_fever.py
to better understand how to implement the test. Also do not hesitate to import a library and to add it to requirements.txt
.
.3 In src/test.py
add MyTest
to the list of valid_tests
.
.4 Run reset_experiment.py
then run src/aggregate_data.py
.
Functional testing is a popular testing technique for software engineers. The following definition is adapted from the software engineering field to our intended usage in machine learning.
Functional tests are created by testers with no specific knowledge of the algorithm's internal modules, i.e., not the developers themselves. Therefore, the algorithm is considered a black-box and is executed from end to end.
Each functional test is intended to verify an end-user requirement on the xAI algorithm rather than a specific internal module. Thus, functional tests share the advantage of being able to test different algorithms, if they respect the same input and output format. On the other hand, failed tests do not inform about the location of the errors but rather attribute it to the entire algorithm.
3 metrics are used:
Comprehensibility: a high Comprehensibility mean that it is easy for data scientist to interpret the explanation provided by an xAI algorithm without making errors. We compress all results from shortlisted tests into this metric as explained in the paper.
Portability: is the number of test the xAI can execute. xAI(s) that accept different AI models or data sctructure have a higher protability.
Average execution time: of the tests
Let's consider 2 examples:
maple : portability=17 (high, i.e. can explain different models), comprehensibility=49.15% (bad)
tree_shap : portability=11 (low, i.e. can explain only a few models), comprehensibility=74.15% (good)
It is the choice of the data scientist to use a general xAI with medium performance (like maple) or a specialized xAI algorithm with an explanation respecting most of the known end-user requirements.
The full original code is made available in this repo. Plus you can find a summary of the experiment setup in https://github.com/Karim-53/Compare-xAI/blob/main/data/03_experiment_output_aggregated/test.csv
The source code was inspired from https://github.com/abacusai/xai-bench and https://github.com/mtsang/archipelago
Please cite our work if you use code from this repo:
@article{belaid2022we,
title={Do We Need Another Explainable AI Method? Toward Unifying Post-hoc XAI Evaluation Methods into an Interactive and Multi-dimensional Benchmark},
author={Belaid, Mohamed Karim and H{\"u}llermeier, Eyke and Rabus, Maximilian and Krestel, Ralf},
journal={arXiv preprint arXiv:2207.14160},
year={2022}
}