Skip to content

Final project in 'Tabular Data Science' course by Dr. Amit Somech at Bar-Ilan University.

Notifications You must be signed in to change notification settings

DorinK/Association-Rules-for-Concept-Drifting

Repository files navigation

Association-Rules-for-Concept-Drifting

Final project in 'Tabular Data Science' course by Dr. Amit Somech at Bar-Ilan University.

Example notebooks

There are two notebooks in the notebooks/ directory.

Notebook #1 - EDA Example

The notebook eda_example.ipynb is an example of how to use our ConceptDriftsFinder tool as part of the EDA process.

Notebook #2 - Automatic feature engineering

The notebook automatic_feature_engineeirng.ipynb uses a simple heuristic to try and automatically utilize our ConceptsDriftFinder tool for feature engineering. We test it on 4 different datasets, described below and in the accompanying pdf.

ConceptDriftsFinder

In the Exploratory Data Analysis process (EDA), ConceptsDriftsFinder can be used to automatically find concept drifts. The find_concept_drifts function receives a list of transactions and returns a list of ConceptDriftResult objects.

Example usage

Let's say we suspect the column OverallQual as a concept drift, we can run:

ConceptDriftsFinder().find_concept_drifts(transactions, concept_column="OverallQual", target_column="SalePrice")

Here is a sample of the output in a table format:

left_hand_side right_hand_side confidence_before confidence_after support_before support_after lift_before lift_after concept_cutoff concept_column
0 {'BldgType': '1Fam'} {'SalePrice': 1} 1 0.185229 1 0.155359 1 0.941887 2.8 OverallQual
1 {'FullBath': 1} {'SalePrice': 1} 1 0.374718 0.6 0.163225 1 1.90544 2.8 OverallQual
2 {'GrLivArea': 1} {'SalePrice': 1} 1 0.53 1 0.104228 1 2.69505 2.8 OverallQual
3 {'YearBuilt': 1} {'SalePrice': 1} 1 0.509615 0.8 0.104228 1 2.59139 2.8 OverallQual

We can see that if OverallQual<2.8 then BldgType=1Fam becomes a more important indication of SalePrice. This is useful to better understand our dataset.

Notes

See the following section about preprocessing your data before you can use ConceptDriftsFinder.

The amount of drifts found can be controlled with the following parameters: min_confidence: float, min_support: float, diff_threshold: float (between lift values).

A pandas.DataFrame object can be converted to transactions using the helper function convert_df_to_transactions.

Preprocessing

When working with association rules, we can't use numerical values, only categorical or ordinal. preprocessing.py contains code to convert numerical values to ordinals, as well as data cleaning such as filling N/A with average values or dropping N/A completely.

CutoffValuesFinder

This is an internal class and shouldn't be used directly.

The CutoffValuesFinder classifies each concept as discrete or continuous, and based on that it decides for ConceptDriftsFinder which concept values to try.

ConceptEngineering

To easily start using this library for feature engineering for machine learning models, we created ConceptEngineering. The idea is to automatically take the found concept drifts into account by changing the weights of the features in the dataset.

Example usage

Let's look at the following row from our dataset (shortened for readability):

Id 1101
YearBuilt 1
FullBath 1
OverallQual 2
BldgType_1Fam 1
BldgType_2fmCon 0
BldgType_Duplex 0

If we continue with our example above, we know that if OverallQual<2.8 then BldgType=1Fam becomes a more important indication for SalePrice=1.
We can help the model use this information by increasing the weight of BldgType=1Fam whenever OverallQual<2.8, which can be especially helpful when using models such as LogisticRegression which have a single weight per feature.

We can run:

# Find association rules
df_prep, train_params = preprocess_dataset(df)
X, y = split_X_y(df_prep, columns, train_params, one_hot_columns, target_column)
concept_engineering = ConceptEngineering()
X = concept_engineering.fit_transform(X, df, target_column, one_hot_columns)

Now our new X dataframe will have increased weights based on all of the found rules, for example:

Id 1101
YearBuilt 1.37681
FullBath 1.2922
OverallQual 2
BldgType_1Fam 0.94262
BldgType_2fmCon 0
BldgType_Duplex 0

To calculate the change to the value, we use the difference between the lift values.

OneVsRestClassifier

This is an internal class and shouldn't be used directly.

This is a simple scikit-learn LogisticRegression implementation with one-vs-rest (ovr) (i.e., one model for each label). The only difference in this implementation (instead of using multi_class="ovr") is that we can inject different dataset changes for each label classifier (called label_to_transformation).

This is necessary, because concept rules are per label, and we want to activate the label rules only when classifying the relevant label.

Datasets

To test our library, we used 4 datasets:

About

Final project in 'Tabular Data Science' course by Dr. Amit Somech at Bar-Ilan University.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published