Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What does max_cat_threshold actually control? #10844

Open
oliviermeslin opened this issue Sep 24, 2024 · 3 comments
Open

What does max_cat_threshold actually control? #10844

oliviermeslin opened this issue Sep 24, 2024 · 3 comments

Comments

@oliviermeslin
Copy link

TL;DR: I'd like to know what exactly max_cat_threshold controls and I may suggest marginal improvements of the documentation.


I'm quite interested in XGBoost's support for categorical features. I dived into the documentation, but can't understand the exact effect of max_cat_threshold. By reading the C++ code (here), I understand that it is used to determine the begin oand end points of the double scan of the sorted histogram. Here is an example:

Case with max_cat_threshold = 1

In this case all partitions are considered.

     | Forward  | Backward |
     |----------+----------|
     | [BCDEF] A | F [ABCDE] |
     | [CDEF] AB | EF [ABCD] |
     | [DEF] ABC | DEF [ABC] |
     | [EF] ABCD | CDEF [AB] |
     | [F] ABCDE | BCDEF [A] |

Case with max_cat_threshold = 2

In this case only partitions with 2+ categories are considered.

      | Forward  | Backward |
      |----------+----------|
      | [CDEF] AB | EF [ABCD] |
      | [DEF] ABC | DEF [ABC] |
      | [EF] ABCD | CDEF [AB] |

Is this the way max_cat_threshold? If yes, I might open a PR to add a paragraph here. Does it sound like a good idea?

@trivialfis
Copy link
Member

trivialfis commented Sep 25, 2024

Hi, apologies for the ambiguity there. I wrote the description in https://xgboost.readthedocs.io/en/latest/parameter.html#cat-param

Maximum number of categories considered for each split. Used only by partition-based splits for preventing over-fitting.

It means the split enumeration would stop once it hit the max_cat_threshold, MAXimum number of CATegories considered for each split.

Case with max_cat_threshold = 1
In this case all partitions are considered.

No, the split enumeration would consider 1 partition for backward and forward enumeration. So:

| [BCDEF] A | F [ABCDE] |

then stop. So, only one category is considered for each scan direction. As shown in the definition of iteration end:

it_end = it_begin + n_bins - 1;

@oliviermeslin
Copy link
Author

oliviermeslin commented Oct 4, 2024

@trivialfis Thanks a lot for your detailed answer. I understand better what this hyperparameter controls. This leads me to another question: I do not understand how this hyperparameter could help prevent overfitting. I first explain my reasoning, then I make a suggestion.

Does max_cat_threshold really prevent overfitting?

I have two questions regarding overfitting when Fisher 1958 method is used.

First question: are groups with outliers concentrated at both extremities of the sorted ABCDEF list?

In the documentation of categorical feature support, I read the following sentence: "During split finding, we first sort the gradient histogram to prepare the contiguous partitions then enumerate the splits according to these sorted values." If I understand correctly, this means that groups are first sorted from the lowest gradient (or lowest leaf value $w_i$) to the highest gradient (or highest leaf value).

If this is the case, then groups containing outliers and wrong instances are likely to be located among the very first or very last of this sorted list. Imagine that in the training dataset group A is not different from the average, but contains by chance a few massively wrong instances (for example: the target of a regression problem has an additional 0 for a few instances, so the gradient for these instances is enormous). Then group A is very likely to be the first or last group in the ABCDEF sorted list. Am I wrong?

Second question: why would max_cat_threshold prevent overfitting?

  • if the value of max_cat_threshold is high enough, all partitions are considered => this is a risk of overfitting, right?
  • if the value of max_cat_threshold is low (say: 1), then only highly unbalanced, one-versus-all partitions are considered, like this: | [BCDEF] A | F [ABCDE] |. So there is a high probability that a pointless [BCDEF] A split is selected, leading to overfitting (because the high gradients in group A are noise, not signal).

If I'm right, then no value of max_cat_threshold would prevent overfitting, because max_cat_threshold is either non-binding or prone to isolate noisy groups. Am I wrong?

Would a min_cat_threshold be a better choice?

If I'm not mistaken, then we could replace max_cat_threshold by a min_cat_threshold, the MINimum number of CATegories necessary to create a partition. For instance, with min_cat_threshold = 2, we would rule out one-versus-all splits ([BCDEF] A) and force potential splits to be more balanced:

      | Forward  | Backward |
      |----------+----------|
      | [CDEF] AB | EF [ABCD] |
      | [DEF] ABC | DEF [ABC] |
      | [EF] ABCD | CDEF [AB] |

We could also think of defining min_share_cat, the MINimum share of CATegories necessary to create a partition. For instance, with min_share_cat = 0.2, only partitions more balanced than 80%/20% would be considered, avoiding highly unbalanced splits.

Would that make sense?

@trivialfis
Copy link
Member

trivialfis commented Oct 4, 2024

To answer the first question, it's sorted by the output leaf value as suggested in the document. We want to group categories that outputs similar leaf values. also see code here

auto ret = evaluator.CalcWeightCat(*param_, feat_hist[l]) <

That's an interesting suggestion. There are other choices as well like smoothing. We haven't been able to look into them due to other work at hand and the reluctance to burden users with more difficult to tune parameters. If you are interested in these parameters, please feel free to experiment, I will answer questions and provide assistance as much as possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants