Does this method actually remove redundnacy? #48

AllardJM · 2024-05-31T15:57:25Z

First, great library and related blog posts. I was beginning to code this procedure and then stumbled upon your work. Here is my question / concern. I am using data likely akin to Uber for marketing purposes (mix of continuous and dummy coded, some highly predictive features, some irrelevant and correlation between designed features). If I look at the complete list of features and count the number of features with an (absolute value) correlation over 0.6 there are many. After the feature selection I see more relative correlation. This issue seems to be the F-stat can be very large for some correlated features and it cant be dampened enough by the denominator.

Here is an example from your quick starts (with a bit of change)

from mrmr import mrmr_classif
from sklearn.datasets import make_classification

# create some data
X, y = make_classification(n_samples = 1000, n_features = 100, n_informative = 10, n_redundant = 40)
X = pd.DataFrame(X)
y = pd.Series(y)


corr_X = X.corr().abs().clip(0.00001)

threshold_corr = 0.6
pdf_feature_cnt_corr = pd.Series(corr_X.apply(lambda x: sum(x > threshold_corr)-1, axis=1))
ax = pdf_feature_cnt_corr.value_counts().sort_index().plot(kind = 'bar')
ax.set_xlabel(f'Number of Features Correlated With Feature\n Above {threshold_corr}')
ax.set_ylabel('Number of Features')
ax.bar_label(ax.containers[0])

# use mrmr classification
selected_features = mrmr_classif(X, y, K = 10)

threshold_corr = 0.6
pdf_feature_cnt_corr = pd.Series(corr_X.loc[selected_features,selected_features].apply(lambda x: sum(x > threshold_corr)-1, axis=1))
ax = pdf_feature_cnt_corr.value_counts().sort_index().plot(kind = 'bar')
ax.set_xlabel(f'Number of Features Correlated With Feature\n Above {threshold_corr}')
ax.set_ylabel('Number of Features')
ax.bar_label(ax.containers[0])

It seems to me we have far fewer features but the ones left show strong amount of correlation, in terms of proportion of the model candidate features that are correlated.....

The text was updated successfully, but these errors were encountered:

erinMahoney · 2024-08-03T00:45:15Z

Hello we ran into this issue as well. Our solution was to transform the denominator (leveraging the redundancy parameter) using something like $\frac{1}{[1-abs(correlation)]^4}$, so that significantly correlated values (say correlation > 0.95) would be severely penalized. We also took the square root of the f-statistic.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does this method actually remove redundnacy? #48

Does this method actually remove redundnacy? #48

AllardJM commented May 31, 2024 •

edited

Loading

erinMahoney commented Aug 3, 2024

Does this method actually remove redundnacy? #48

Does this method actually remove redundnacy? #48

Comments

AllardJM commented May 31, 2024 • edited Loading

erinMahoney commented Aug 3, 2024

AllardJM commented May 31, 2024 •

edited

Loading