-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What does max_cat_threshold actually control? #10844
Comments
Hi, apologies for the ambiguity there. I wrote the description in https://xgboost.readthedocs.io/en/latest/parameter.html#cat-param
It means the split enumeration would stop once it hit the
No, the split enumeration would consider 1 partition for backward and forward enumeration. So:
then stop. So, only one category is considered for each scan direction. As shown in the definition of iteration end: xgboost/src/tree/hist/evaluate_splits.h Line 213 in f3df0d0
|
@trivialfis Thanks a lot for your detailed answer. I understand better what this hyperparameter controls. This leads me to another question: I do not understand how this hyperparameter could help prevent overfitting. I first explain my reasoning, then I make a suggestion. Does
|
To answer the first question, it's sorted by the output leaf value as suggested in the document. We want to group categories that outputs similar leaf values. also see code here xgboost/src/tree/hist/evaluate_splits.h Line 380 in f3df0d0
That's an interesting suggestion. There are other choices as well like smoothing. We haven't been able to look into them due to other work at hand and the reluctance to burden users with more difficult to tune parameters. If you are interested in these parameters, please feel free to experiment, I will answer questions and provide assistance as much as possible. |
TL;DR: I'd like to know what exactly
max_cat_threshold
controls and I may suggest marginal improvements of the documentation.I'm quite interested in XGBoost's support for categorical features. I dived into the documentation, but can't understand the exact effect of
max_cat_threshold
. By reading the C++ code (here), I understand that it is used to determine the begin oand end points of the double scan of the sorted histogram. Here is an example:Case with
max_cat_threshold
= 1In this case all partitions are considered.
Case with
max_cat_threshold
= 2In this case only partitions with 2+ categories are considered.
Is this the way
max_cat_threshold
? If yes, I might open a PR to add a paragraph here. Does it sound like a good idea?The text was updated successfully, but these errors were encountered: