Here is a summary on the currently suggested statistical tests for comparing classification algorithms. All work is implemented in python.
A summary of the current ltierature is given in the slides below
The suggestion in Approximate statistical tests for comparing supervised classification learning algorithms (1998) is to perform a 5x2 t-test. This was further extended in Combined 5 × 2 cv F Test for Comparing Supervised Classification Learning Algorithms to the 5x2 f-test which is also recommended by the original authors above to be the new standard.
These tests are implemented in mlxtend
5x2 t-test (Note: prefer extension below)
Alternatively, if computational resources only allow a method to be run a single time (unlike the 10 times required above), the only acceptable statistical test appears to be the McNemars test, which is also implemented in mlxtend.
If we are comparing 2 classifiers across several datasets, the suggestion in Statistical Comparisons of Classifiers over Multiple Data Sets (2006) is to perform Wilcoxon Signed-Rank Tests between the datasets.
This is shown in:
If we are comparing multiple methods across multiple datasets, the suggestions in Statistical Comparisons of Classifiers over Multiple Data Sets (2006) are to perform a Friedman test, paired with either Nemenyi post hoc analysis (for comparing all methods), or FWER correction if comparing to a control classifier (i.e. comparing several methods to one dataset, not pairwise comparisons).
Both tests are shown in the following Jupyter notebook
All credits to the original authors of the above articles