*Please change the directory and install the relevant packages before running the code.
- Relevant datasets (synthetic data and breast cancer) of the thesis.
- Disclaimer: I do not own any rights for the "Breast Cancer Wisconsin (Diagnostic) Dataset".
- Orginal download link: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29
- Clear instructions about the data preprocessing methodology
- Feature selection based on: ANOVA, correlation.
- Data transformations (natural logarithm, square root, cube root)
- Normality tests
- Synthetic data generation
- Visualization of densities pre- and post-data-transformation
- Model selection of the preliminary study
- Evaluation of various configurations for each classifier
- Cross validation and test performance
- Cross validation box-and-whisker-plots
- Baseline model optimization (of bagged SVMs and XGB) through multi-random-search with customized function.
- Optimization of sampling algorithms via grid search.
- Tuned parameters: Class distribution (alpha), number of neighbours.
- Considered sampling algorithms: Random Oversampling, Synthetic Minority Oversampling Technique, Adaptive Synthetic Sampling, Random Undersampling, Edited Nearest Neighbours, Neighbour Cleaning Rule.
- Optimization of costs for weighted bagged SVMs and weighted XGB via multi-random-search. (can also be done via grid search)
- Comparison of results: heuristic weights vs. optimized costs
- Cross validation results of sampling and CSL implementation
- Cross validation box-and-whisker-plots
- Test performance with default paramters
- Test performance with optimized parameters
- Test performance with repeated experiments for non-deterministic sampling algorithms. (Results based on a single seed are not representative due to the variance across seeds. Hence, 100 experiments with different seeds are conducted and the mean performance result therefrom is considered as the final statistic.)
- Implementation of hybrid models: Underbagging (with Decision Trees), Underbagging (with SVMs), EasyEnsemble (original specification with AdaBoost), EasyEnsemble (with XGB => also called xEnsemble in some papers.)
- Cross validation results and box-and-whisker-plots
- Test performance
- 2 (inspired/adapted) codes for additional plots (visualization of the implementation of sampling algorithms and CSL) in the theory section.