- A structured data science pipeline for classification problems that does multiple purposes like scaling, sampling, k-fold stratified cross validation (CV) with evaluation metrics.
- It reduces the need for users to rewrite a lot of code as it's reusability is very high.
- Refer to sklearn_classification_pipeline.py for the full code
- For pipeline that does not have k-fold cross validation which leads to faster testing, use sklearn_classifier_pipeline_optionalCV.py. * However this file still supports the internal sklearn cross validation method (can be switched on or off by parameter input). To use the custom k-fold stratified cross validation method, use sklearn_classification_pipeline.py instead.
- Dataset can be downloaded at https://www.kaggle.com/mlg-ulb/creditcardfraud
- Customized pipeline works for all forms of classification problems, including fraud detection problems that require oversampling techniques.
- Data pipeline allows users to do prior data cleansing and feature engineering, as long as df is DataFrame format with both features and response
- As users might have a list of features to select, pipeline allows a varlist input which is an array of features to use for the DF. This complements any feature selection techniques used before this pipeline such as forward/backward selection
- Data pipeline also caters for stratified k-fold cross validation for any value k>1
- Pipeline supports various kinds of evaluation metrics (accuracy, sensitivity/recall, precision, selectivity, f1, AUC value) and user can add more as required
- User can easily add new models into the pipeline as required without rerunning/rewriting a lot of code
- Pipeline allows for data scaling/standardization
- High customizability and reusability of the code while reducing the need to rewrite a lot of code (leading to spaghetti coding)
To run pipeline, import sklearn_classification_pipeline.py (stratified CV) or sklearn_classifier_pipeline_optionalCV.py (can switch off CV).
Create new class object modelpipeline()
Next, execute modelpipeline.runmodel(...) with the required parameters
- df - DataFrame that has went through data cleaning, processing, feature engineering, containing both features and response
No standardization/scaling is required as there is built in function for that - varlist - List of all variables/features to use in model, including the response variable.
This allows user to do feature selection first and select only the relevant features before running the model - response - Name of the response variable in string format
- sampletype - For undersampled data, you can use 'naive', 'smote' or 'adasyn' oversampling. If other strings are input, then no oversampling is done.
- modelname - Choose the type of model to run - user can add more models as required using the if/else clause to check this string input in the buildmodel function
- text - Text title to put for the confusion matrix output in each iteration of the n-folds stratified cross validation
- n-fold - number of folds for the stratified cross validation
- Note that sklearn_classifier_pipeline_optionalCV.py has an additional parameter at the end called CV. If CV=False, then cross validation will be switched off.
- Remember to save the dictionary object into a variable - e.g. results = modelpipeline.runmodel(...) so that the evaluation results can be saved and reused.
After the tests have finished, you can read the dictionary object storing the evaluation metrics results. In this case, results['final'] store the averaged results for k-fold cross validation while the other key-value pairs will store the evaluation metric result of each individual iteration in a list.
The dictionary object returned will have results for each fold of k-fold cross validation. Evaluation metrics include:
- Accuracy
- Actual Accuracy (Optional - can be a hold out dataset for testing and can be other metrics other than accuracy)
- Sensitvity
- Specificity
- Precision
- f1 score
- (ROC) AUC value
- (PR) AUC value
- Averaged values for 1-8 stored in dictionary object tagged to 'final' key
Object Template = {"accuracy": [...], "actual_accuracy": [...], "sensitivity": [...], "specificity": [...], "precision": [...], "f1": [...], "auc": [...], "pr_auc": [...], "final": {...}}
For sklearn_classifier_pipeline_optionalCV.py, it returns results in a string format instead of a list of strings as there is only round of train-test. There is also the option to tweak the code to export the best model/transformed train-test dataset out for usage. By default, this is not done to free up memory usage.
In the last iteration, ROC-AUC curve and PR-AUC curve will be plotted for users to analyze. For the individual AUC results, users can refer to the dictionary object output.
Updated: 19 May 2020
Assuming a large amount of training variables, forward selection can be used to prune the number of variables for model training
Refer to forward_elim_binary.py for function to do forward selection and (optional) oversampling
Note that oversampling might cause problems during forward selection due to the training numpy matrix becoming singular and unable to solve the inverse
def forward_selection(df, sig_level, response, removelist, sampling='nil', testratio=0):
"""
:param df: dataframe with both training and response variables
:param sig_level: significance level to accept/reject var during forward selection
:param response: name of response var in dataframe
:param removelist: list of training variables to remove from dataframe
:param sampling: type of oversampling to use, smote, naive or nil, default: no sampling done
:param testratio: proportion of dataset to remove out before doing oversampling, default: 0
:return: list of training variables
"""
Updated: 19 May 2020
Full credit to Vishal R for his medium post Feature selection — Correlation and P-value and his code for backward elimination which I adapted
Assuming a large amount of training variables, backward selection can be used to prune the number of variables for model training
Refer to backward_elim_binary.py for function to do backward selection
def backward_elimination(x, Y, sl, columns):
"""
:param x: numpy array of training variables
:param Y: Y is numpy array of response variable
:param sl: significance level in float
:param columns: list of columns in same horizontal order with x
:return: numpy array of selected x AND list of selected training variables passing sig level
"""