Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MLForecast and negative boosted tree predictions #457

Closed
koaning opened this issue Nov 25, 2024 · 5 comments
Closed

MLForecast and negative boosted tree predictions #457

koaning opened this issue Nov 25, 2024 · 5 comments
Labels

Comments

@koaning
Copy link

koaning commented Nov 25, 2024

What happened + What you expected to happen

During the probabl livestream last week (YT link here, notebook here), I may have stumbled on a bug. Figured that I should report it.

The short story is that while the input dataset has no negative values, some of the predicted values are negative. For a linear model this could make sense, but for a boosted tree model it does not. Tree models, after all, can only interpolate the training data. It is something that became a talking point during this segment of the livestream.

Possible cause

After diving a bit deeper I may have found a good lead on the cause too. My dataset has hourly data but there are a few timeslots missing. I am predicting number of people that leave a subway station and these stations can be closed during a few hours in the day. These rows do not show up in my original dataset. When I was using mlforecast this didn't give me any warnings but when I gave the dataset to TimeGPT I was prompted to use fill_gaps to make sure that there are no missing rows.

When I apply fill_gaps to my data before passing it to MLForecast the results do not show negative numbers for the boosted tree model anymore. This suggests to me that it might be good to throw a similar warning message here? I am not completely aware of the Nixtla internals, so I might be missing an important detail here, but since silent warnings can be painful I figured I should at least write up this report here.

Versions / Dependencies

mlforecast version 0.15.0

Reproduction script

I added a notebook link in the above description, as well as a YT link that shows the error. While reproduction could be useful, my current impression is that the main issue here is the fact that an error message is missing.

I figured setting a medium issue on this one. Silent failures can make the whole stack crumble but I have technically found a work-around.

Issue Severity

Medium: It is a significant difficulty but I can work around it.

@koaning koaning added the bug label Nov 25, 2024
@jmoralez
Copy link
Member

Hey @koaning, thanks for raising this. Is there a place where I can download the data?

@koaning
Copy link
Author

koaning commented Nov 26, 2024

The notebook links to this repository. It was originally found on Kaggle.

@jmoralez
Copy link
Member

Thanks, sorry I missed that. I re-read the issue and the statement about boosting not being able to produce predictions out of the original target isn't true, it's true for regular decision trees and random forests, but boosting is an additive algorithm, so it can definitely produce values outside the original range. Here's an example:

import numpy as np
from sklearn.ensemble import HistGradientBoostingRegressor

rng = np.random.default_rng(seed=0)
X = rng.random((10_000, 4))
y = rng.choice([0, 1, 2], size=10_000, replace=True, p=[0.8, 0.1, 0.1])
model = HistGradientBoostingRegressor().fit(X, y)
preds = model.predict(X)
assert y.min() == 0
assert preds.min() < 0

@koaning
Copy link
Author

koaning commented Nov 26, 2024

d0h! @jmoralez yeah, you're right. Thanks for the example!

It might still be a good idea to warn folks about the fill_gaps utility. But I will leave it up to you to make a new issue for that or to rename this one.

@jmoralez
Copy link
Member

I'll open a new issue for that. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants