Fraud-Detection-Pipeline

A structured data science pipeline for classification problems that does multiple purposes like scaling, sampling, k-fold stratified cross validation (CV) with evaluation metrics.
It reduces the need for users to rewrite a lot of code as it's reusability is very high.
Refer to sklearn_classification_pipeline.py for the full code
For pipeline that does not have k-fold cross validation which leads to faster testing, use sklearn_classifier_pipeline_optionalCV.py. * However this file still supports the internal sklearn cross validation method (can be switched on or off by parameter input). To use the custom k-fold stratified cross validation method, use sklearn_classification_pipeline.py instead.
Dataset can be downloaded at https://www.kaggle.com/mlg-ulb/creditcardfraud

Strengths of using a data pipeline:

Customized pipeline works for all forms of classification problems, including fraud detection problems that require oversampling techniques.
Data pipeline allows users to do prior data cleansing and feature engineering, as long as df is DataFrame format with both features and response
As users might have a list of features to select, pipeline allows a varlist input which is an array of features to use for the DF. This complements any feature selection techniques used before this pipeline such as forward/backward selection
Data pipeline also caters for stratified k-fold cross validation for any value k>1
Pipeline supports various kinds of evaluation metrics (accuracy, sensitivity/recall, precision, selectivity, f1, AUC value) and user can add more as required
User can easily add new models into the pipeline as required without rerunning/rewriting a lot of code
Pipeline allows for data scaling/standardization
High customizability and reusability of the code while reducing the need to rewrite a lot of code (leading to spaghetti coding)

Instructions:

To run pipeline, import sklearn_classification_pipeline.py (stratified CV) or sklearn_classifier_pipeline_optionalCV.py (can switch off CV).
Create new class object modelpipeline()
Next, execute modelpipeline.runmodel(...) with the required parameters

Parameters for sklearn_classification_pipeline.py are:

df - DataFrame that has went through data cleaning, processing, feature engineering, containing both features and response
No standardization/scaling is required as there is built in function for that
varlist - List of all variables/features to use in model, including the response variable.

This allows user to do feature selection first and select only the relevant features before running the model
response - Name of the response variable in string format
sampletype - For undersampled data, you can use 'naive', 'smote' or 'adasyn' oversampling. If other strings are input, then no oversampling is done.
modelname - Choose the type of model to run - user can add more models as required using the if/else clause to check this string input in the buildmodel function
text - Text title to put for the confusion matrix output in each iteration of the n-folds stratified cross validation
n-fold - number of folds for the stratified cross validation
Note that sklearn_classifier_pipeline_optionalCV.py has an additional parameter at the end called CV. If CV=False, then cross validation will be switched off.
Remember to save the dictionary object into a variable - e.g. results = modelpipeline.runmodel(...) so that the evaluation results can be saved and reused.

Results:

After the tests have finished, you can read the dictionary object storing the evaluation metrics results. In this case, results['final'] store the averaged results for k-fold cross validation while the other key-value pairs will store the evaluation metric result of each individual iteration in a list.

The dictionary object returned will have results for each fold of k-fold cross validation. Evaluation metrics include:

Accuracy
Actual Accuracy (Optional - can be a hold out dataset for testing and can be other metrics other than accuracy)
Sensitvity
Specificity
Precision
f1 score
(ROC) AUC value
(PR) AUC value
Averaged values for 1-8 stored in dictionary object tagged to 'final' key

Object Template = {"accuracy": [...], "actual_accuracy": [...], "sensitivity": [...], "specificity": [...], "precision": [...], "f1": [...], "auc": [...], "pr_auc": [...], "final": {...}}
For sklearn_classifier_pipeline_optionalCV.py, it returns results in a string format instead of a list of strings as there is only round of train-test. There is also the option to tweak the code to export the best model/transformed train-test dataset out for usage. By default, this is not done to free up memory usage.

Graphs for classification problems

In the last iteration, ROC-AUC curve and PR-AUC curve will be plotted for users to analyze. For the individual AUC results, users can refer to the dictionary object output.

Variable selection via forward elimination

Updated: 19 May 2020
Assuming a large amount of training variables, forward selection can be used to prune the number of variables for model training
Refer to forward_elim_binary.py for function to do forward selection and (optional) oversampling
Note that oversampling might cause problems during forward selection due to the training numpy matrix becoming singular and unable to solve the inverse

def forward_selection(df, sig_level, response, removelist, sampling='nil', testratio=0):
    """
    :param df: dataframe with both training and response variables
    :param sig_level: significance level to accept/reject var during forward selection
    :param response: name of response var in dataframe
    :param removelist: list of training variables to remove from dataframe
    :param sampling: type of oversampling to use, smote, naive or nil, default: no sampling done
    :param testratio: proportion of dataset to remove out before doing oversampling, default: 0
    :return: list of training variables
    """

Variable selection via backward elimination

Updated: 19 May 2020
Full credit to Vishal R for his medium post Feature selection — Correlation and P-value and his code for backward elimination which I adapted
Assuming a large amount of training variables, backward selection can be used to prune the number of variables for model training
Refer to backward_elim_binary.py for function to do backward selection

def backward_elimination(x, Y, sl, columns):
    """
    :param x: numpy array of training variables
    :param Y: Y is numpy array of response variable
    :param sl: significance level in float
    :param columns: list of columns in same horizontal order with x
    :return: numpy array of selected x AND list of selected training variables passing sig level
    """

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
Confusion_Matrix.PNG		Confusion_Matrix.PNG
Install imbalanced-learn - imblearn.ipynb		Install imbalanced-learn - imblearn.ipynb
LICENSE		LICENSE
OVER01. Naive Oversampling, SMOTE and ADASYN.ipynb		OVER01. Naive Oversampling, SMOTE and ADASYN.ipynb
OVER02. Model building with naive sample and SMOTE.ipynb		OVER02. Model building with naive sample and SMOTE.ipynb
OVER03. Testing codes for sampling.ipynb		OVER03. Testing codes for sampling.ipynb
OVER04. Variable correlation and selection.ipynb		OVER04. Variable correlation and selection.ipynb
PR_AUC_Curve.PNG		PR_AUC_Curve.PNG
README.md		README.md
ROC_AUC_Curve.PNG		ROC_AUC_Curve.PNG
SCI16. Forward Elimination and Modified Pipeline with parameterized input feature list and response.ipynb		SCI16. Forward Elimination and Modified Pipeline with parameterized input feature list and response.ipynb
SCI17. Creating custom stratified crossvalidation pipeline.ipynb		SCI17. Creating custom stratified crossvalidation pipeline.ipynb
backward_elim_binary.py		backward_elim_binary.py
forward_elim_binary.py		forward_elim_binary.py
results.PNG		results.PNG
sklearn_classification_pipeline.py		sklearn_classification_pipeline.py
sklearn_classifier_pipeline_optionalCV.py		sklearn_classifier_pipeline_optionalCV.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fraud-Detection-Pipeline

Strengths of using a data pipeline:

Instructions:

Parameters for sklearn_classification_pipeline.py are:

Results:

Graphs for classification problems

Variable selection via forward elimination

Variable selection via backward elimination

About

Releases

Packages

Languages

License

kohjiaxuan/Fraud-Detection-Pipeline

Folders and files

Latest commit

History

Repository files navigation

Fraud-Detection-Pipeline

Strengths of using a data pipeline:

Instructions:

Parameters for sklearn_classification_pipeline.py are:

Results:

Graphs for classification problems

Variable selection via forward elimination

Variable selection via backward elimination

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages