feature_importance error #29

paulperry · 2019-11-28T13:28:02Z

I'm running into an error and summarized it in this toy example:

X = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]])
y = pd.DataFrame([0,0,1])
g = pd.Series([1,1,2])
dataset = Dataset(X, y, g, name='dataset')
mse = MSE()

feature_analysis = feature_importance(model=X, dataset=dataset, metric=mse)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-39-be667221664d> in <module>
      5 mse = MSE()
      6 
----> 7 feature_analysis = feature_importance(model=X, dataset=dataset, metric=mse)

~/anaconda3/lib/python3.7/site-packages/rankeval/analysis/feature.py in feature_importance(model, dataset, metric, normalize)
     63 
     64     if isinstance(metric, RMSE) or isinstance(metric, MSE):
---> 65         feature_imp, feature_count = eff_feature_importance(model, dataset)
     66         if isinstance(metric, RMSE):
     67             feature_imp[0] = np.sqrt(feature_imp[0])

~/anaconda3/lib/python3.7/site-packages/rankeval/analysis/_efficient_feature.pyx in rankeval.analysis._efficient_feature.eff_feature_importance()

TypeError: Cannot convert DataFrame to numpy.ndarray

The text was updated successfully, but these errors were encountered:

strani · 2019-12-02T16:25:30Z

The feature_importance analysis take a model and a dataset and compute the feature importance relative to the given inputs. The model should be an object of the class RTEnsemble while the dataset should be an object of the class Dataset. Both classes are defined into the rankeval package.

Take a look at the following notebook for a clear picture of how to use this feature analysis tool.

paulperry · 2019-12-05T22:41:09Z

Sorry, my example was incomplete, and I was passing a properly trained model instead of 'X'. I was trying to get a simple example to narrow down where the problem is. Trying again :

from xgboost import XGBClassifier
model = XGBClassifier()

# my original model didn't work, so commenting this out for now
# rankeval_xgb_model = RTEnsemble('xgb.model', name="XGB model", format="XGBoost")

X = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]])
y = pd.DataFrame([0,0,1])
g = pd.Series([1,1,2])

model.fit(X, y)
model.get_booster().dump_model('dumb.model')

RTmodel = RTEnsemble('dumb.model', name="XGB model", format="XGBoost")

dataset = Dataset(X, y, g, name='dataset')
mse = MSE()

feature_analysis = feature_importance(model=RTmodel, dataset=dataset, metric=mse)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-19-cb10eb786f27> in <module>
     16 mse = MSE()
     17 
---> 18 feature_analysis = feature_importance(model=RTmodel, dataset=dataset, metric=mse)

~/anaconda3/lib/python3.7/site-packages/rankeval/analysis/feature.py in feature_importance(model, dataset, metric, normalize)
     63 
     64     if isinstance(metric, RMSE) or isinstance(metric, MSE):
---> 65         feature_imp, feature_count = eff_feature_importance(model, dataset)
     66         if isinstance(metric, RMSE):
     67             feature_imp[0] = np.sqrt(feature_imp[0])

~/anaconda3/lib/python3.7/site-packages/rankeval/analysis/_efficient_feature.pyx in rankeval.analysis._efficient_feature.eff_feature_importance()

TypeError: Cannot convert DataFrame to numpy.ndarray

I am able to run the Feature Analysis notebook, so I know my installation is good. I just don't have a good error that I can use to look into what the problem is with my model.

Thank you.

strani · 2019-12-10T14:29:17Z

Ok you just found a bug in the feature analysis implementation. The bug is very subtle and is due to the fact the model you are training is empty. Indeed, a portion of the snapshot of the XBG model is the following:

booster[0]:
0:leaf=-0.0285714306
booster[1]:
0:leaf=-0.0273494069
booster[2]:
0:leaf=-0.0261842143
booster[3]:
0:leaf=-0.0250727069
booster[4]:
0:leaf=-0.0240119882
booster[5]:
0:leaf=-0.02299935

Each booster (tree) is described by a single leaf, without any split node (i.e., the root is also a leaf). The consequence of having to compute the feature importance over such a model is that...well, there is no feature contribution, given there is no split node using any of the features of the dataset. I should improve the formal checks in such a way this subtle case would not end up in a crash of the python interpreter (or a crash of the jupyter notebook). This crash is due the cython implementation of the analysis.

P.S. I should also improve the documentation. Indeed the Dataset class (as well as other classes in rankeval) currently supports only numpy array and not also Pandas dataframe/series. It would probably be better to extend the supported type also to pandas data types.

Thank you for discovering the problem. BTW, I ensure you in traditional situations (i.e., models trained properly not being empty) the feature importance analysis is working correctly.

I'll re-open the issue and keep it here until documentation and code have been fixed.

paulperry · 2019-12-10T16:04:25Z

I have a properly trained model that is able to output the regular XGB feature importance, but I get the same error reported here with rankeval importance. Let me know if you want me to send you the .model file and dataset. Thank you!

strani · 2019-12-10T17:39:18Z

Sure, send the model to the rankeval mail
I'll try to give it a look, despite missing of the dataset used for training the model. It would be perfect if you could send a snapshot of the dataset as well.
Thanks!

strani closed this as completed Dec 2, 2019

strani reopened this Dec 10, 2019

strani added bug enhancement help wanted labels Dec 10, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature_importance error #29

feature_importance error #29

paulperry commented Nov 28, 2019

strani commented Dec 2, 2019

paulperry commented Dec 5, 2019

strani commented Dec 10, 2019 •

edited

Loading

paulperry commented Dec 10, 2019

strani commented Dec 10, 2019

feature_importance error #29

feature_importance error #29

Comments

paulperry commented Nov 28, 2019

strani commented Dec 2, 2019

paulperry commented Dec 5, 2019

strani commented Dec 10, 2019 • edited Loading

paulperry commented Dec 10, 2019

strani commented Dec 10, 2019

strani commented Dec 10, 2019 •

edited

Loading