What does max_cat_threshold actually control? #10844

oliviermeslin · 2024-09-24T11:43:37Z

TL;DR: I'd like to know what exactly max_cat_threshold controls and I may suggest marginal improvements of the documentation.

I'm quite interested in XGBoost's support for categorical features. I dived into the documentation, but can't understand the exact effect of max_cat_threshold. By reading the C++ code (here), I understand that it is used to determine the begin oand end points of the double scan of the sorted histogram. Here is an example:

Case with max_cat_threshold = 1

In this case all partitions are considered.

     | Forward  | Backward |
     |----------+----------|
     | [BCDEF] A | F [ABCDE] |
     | [CDEF] AB | EF [ABCD] |
     | [DEF] ABC | DEF [ABC] |
     | [EF] ABCD | CDEF [AB] |
     | [F] ABCDE | BCDEF [A] |

Case with max_cat_threshold = 2

In this case only partitions with 2+ categories are considered.

      | Forward  | Backward |
      |----------+----------|
      | [CDEF] AB | EF [ABCD] |
      | [DEF] ABC | DEF [ABC] |
      | [EF] ABCD | CDEF [AB] |

Is this the way max_cat_threshold? If yes, I might open a PR to add a paragraph here. Does it sound like a good idea?

The text was updated successfully, but these errors were encountered:

trivialfis · 2024-09-25T09:57:59Z

Hi, apologies for the ambiguity there. I wrote the description in https://xgboost.readthedocs.io/en/latest/parameter.html#cat-param

Maximum number of categories considered for each split. Used only by partition-based splits for preventing over-fitting.

It means the split enumeration would stop once it hit the max_cat_threshold, MAXimum number of CATegories considered for each split.

Case with max_cat_threshold = 1
In this case all partitions are considered.

No, the split enumeration would consider 1 partition for backward and forward enumeration. So:

| [BCDEF] A | F [ABCDE] |

then stop. So, only one category is considered for each scan direction. As shown in the definition of iteration end:

xgboost/src/tree/hist/evaluate_splits.h

Line 213 in f3df0d0

it_end = it_begin + n_bins - 1;

oliviermeslin · 2024-10-04T10:26:47Z

@trivialfis Thanks a lot for your detailed answer. I understand better what this hyperparameter controls. This leads me to another question: I do not understand how this hyperparameter could help prevent overfitting. I first explain my reasoning, then I make a suggestion.

Does `max_cat_threshold` really prevent overfitting?

I have two questions regarding overfitting when Fisher 1958 method is used.

First question: are groups with outliers concentrated at both extremities of the sorted `ABCDEF` list?

In the documentation of categorical feature support, I read the following sentence: "During split finding, we first sort the gradient histogram to prepare the contiguous partitions then enumerate the splits according to these sorted values." If I understand correctly, this means that groups are first sorted from the lowest gradient (or lowest leaf value $w_i$) to the highest gradient (or highest leaf value).

If this is the case, then groups containing outliers and wrong instances are likely to be located among the very first or very last of this sorted list. Imagine that in the training dataset group A is not different from the average, but contains by chance a few massively wrong instances (for example: the target of a regression problem has an additional 0 for a few instances, so the gradient for these instances is enormous). Then group A is very likely to be the first or last group in the ABCDEF sorted list. Am I wrong?

Second question: why would `max_cat_threshold` prevent overfitting?

if the value of max_cat_threshold is high enough, all partitions are considered => this is a risk of overfitting, right?
if the value of max_cat_threshold is low (say: 1), then only highly unbalanced, one-versus-all partitions are considered, like this: | [BCDEF] A | F [ABCDE] |. So there is a high probability that a pointless [BCDEF] A split is selected, leading to overfitting (because the high gradients in group A are noise, not signal).

If I'm right, then no value of max_cat_threshold would prevent overfitting, because max_cat_threshold is either non-binding or prone to isolate noisy groups. Am I wrong?

Would a `min_cat_threshold` be a better choice?

If I'm not mistaken, then we could replace max_cat_threshold by a min_cat_threshold, the MINimum number of CATegories necessary to create a partition. For instance, with min_cat_threshold = 2, we would rule out one-versus-all splits ([BCDEF] A) and force potential splits to be more balanced:

      | Forward  | Backward |
      |----------+----------|
      | [CDEF] AB | EF [ABCD] |
      | [DEF] ABC | DEF [ABC] |
      | [EF] ABCD | CDEF [AB] |

We could also think of defining min_share_cat, the MINimum share of CATegories necessary to create a partition. For instance, with min_share_cat = 0.2, only partitions more balanced than 80%/20% would be considered, avoiding highly unbalanced splits.

Would that make sense?

trivialfis · 2024-10-04T10:35:43Z

To answer the first question, it's sorted by the output leaf value as suggested in the document. We want to group categories that outputs similar leaf values. also see code here

xgboost/src/tree/hist/evaluate_splits.h

Line 380 in f3df0d0

auto ret = evaluator.CalcWeightCat(*param_, feat_hist[l]) <

That's an interesting suggestion. There are other choices as well like smoothing. We haven't been able to look into them due to other work at hand and the reluctance to burden users with more difficult to tune parameters. If you are interested in these parameters, please feel free to experiment, I will answer questions and provide assistance as much as possible.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What does max_cat_threshold actually control? #10844

What does max_cat_threshold actually control? #10844

oliviermeslin commented Sep 24, 2024

trivialfis commented Sep 25, 2024 •

edited

Loading

oliviermeslin commented Oct 4, 2024 •

edited

Loading

trivialfis commented Oct 4, 2024 •

edited

Loading

What does max_cat_threshold actually control? #10844

What does max_cat_threshold actually control? #10844

Comments

oliviermeslin commented Sep 24, 2024

trivialfis commented Sep 25, 2024 • edited Loading

oliviermeslin commented Oct 4, 2024 • edited Loading

Does max_cat_threshold really prevent overfitting?

First question: are groups with outliers concentrated at both extremities of the sorted ABCDEF list?

Second question: why would max_cat_threshold prevent overfitting?

Would a min_cat_threshold be a better choice?

trivialfis commented Oct 4, 2024 • edited Loading

trivialfis commented Sep 25, 2024 •

edited

Loading

oliviermeslin commented Oct 4, 2024 •

edited

Loading

Does `max_cat_threshold` really prevent overfitting?

First question: are groups with outliers concentrated at both extremities of the sorted `ABCDEF` list?

Second question: why would `max_cat_threshold` prevent overfitting?

Would a `min_cat_threshold` be a better choice?

trivialfis commented Oct 4, 2024 •

edited

Loading