MAINT add parameter validation using BaseEstimator #958

glemaitre · 2024-06-17T21:54:53Z

This bring parameter validation that is included in the supported scikit-learn version.

TheooJ · 2024-06-17T22:53:44Z

Should we add this for all BaseEstimators ?

glemaitre · 2024-06-18T08:58:37Z

Yep, we could basically have this everywhere. I just wanted to make a test.
But apparently, we added the parameter validation in scikit-learn 1.3 and thus the tests are failing (the minimum version is 1.2.1).

So if we want to support, we would need to vendor a file like in imbalanced-learn: https://github.com/scikit-learn-contrib/imbalanced-learn/blob/master/imblearn/utils/_param_validation.py
Additionally, we would need to create a mixin.

I don't think this is worth the development time right now: we might bump the minimal version of scikit-learn and then it would be usable. So we can let this PR on the side for the moment.

Here are two future advantages that I foresee:

Having informative and consistent error that are complementary to the documentation
Making sure that the documentation is aligned with the code. Then, it would ensure effort that create automatic stubs (https://github.com/scientific-python/docstub) will provide the proper types. On the UI/UX side, I think this would be an interesting thing to have.

But let's postpone.

jeromedockes · 2024-06-18T10:08:39Z

relevant discussion in scikit-learn: scikit-learn/scikit-learn#22722

I would say for now I am +0 on this, I suggest we discuss it in one of the skrub meetings.

some potential drawbacks IMHO:

this looks a lot like type annotations but that are not recognized by mypy or other tools (I realize this pr is about runtime checks but those could inspect annotations instead of custom attributes). this means we do the work of adding annotations but don't get the benefit of static checking, improved autocompletion in editors etc. also, contributors have to learn the scikit-learn (private) constraints instead of standard typing utilities which they may already know, eg StrOptions instead of Literal
I'm not sure it is really necessary to systematically check the types of all parameters at runtime and I don't think it is super common to do this in Python code. maybe we could allow specifying constraints only for some parameters and the others would be understood to have no constraints/no runtime checks
quite a few of the user errors we check (and I would say they are the most likely ones) cannot be covered by the constraints with the error messages we want AFAICT, eg checking that joining columns are found in the dataframe (which involves 2 parameters not just 1), or that the same kind of dataframe with the same columns is used at fit and test time. Also some checks would require using a callable as the constraint eg checking the contents of a dataframe column, in which case the benefit of the declarative way of specifying constraints is much reduced
adding the constraints means wrapping the methods that use them which causes slightly more complex and less helpful tracebacks
it would require using some private scikit-learn objects such as _fit_context and _param_validation
as you say there is the question of development time / priority choices

None of these are major issues and they may well be small drawbacks compared to the advantages listed in the scikit-learn issue 22722

jeromedockes · 2024-06-18T10:27:58Z

skrub/_agg_joiner.py

@@ -67,7 +67,7 @@ class AggJoiner(TransformerMixin, BaseEstimator):
        The placeholder string "X" can be provided to perform
        self-aggregation on the input data.

-    key : str, default=None
+    key : str or iterable of str, default=None


as a side note, we use "iterable" everywhere but I wonder if we should say "sequence" (or "list"?) because it is more understandable for users who are less familiar with the python/computer programming jargon. also it is arguably a bit more accurate because we iterate over these parameters several times and sometimes index them so some iterables would not be appropriate

+1 to discuss this
I had the same question on the Joiner PR

For me, this should be a list (and you can accept loosely tuple) but this is more friendly than stating sequence that is only meaningful for Python developer.

I'm fine with "list" -- and we already use that term in a bunch of places

glemaitre · 2024-06-18T10:55:44Z

None of these are major issues and they may well be small drawbacks compared to the advantages listed in the scikit-learn issue 22722

They are valid point. I would say that this feature is supposed to become part of the developer API at some point.

Regarding the runtime checks, I think this is just a way to have consistent checks regarding the parameter instead of delegating to the dev that create the class. For the user, you get the benefit of a nice error message. I know that SciPy was looking at a similar checking system.

To me the real benefit is the next step regarding consistent documentation and stubs.

But as I said, this is really not a priority right now considering the effort of development. At least now, @TheooJ and @jeromedockes are aware that this is existing and it would cost defining a dictionary ;)

jeromedockes · 2024-06-18T11:12:52Z

At least now, @TheooJ and @jeromedockes are aware that this is existing and it would cost defining a dictionary ;)

indeed, thanks :)
also, this doesn't have to be all-or-nothing: we could start using it for estimators for which it would replace or simplify already existing checks (as the datetimeencoder example in this PR)

(once we drop support for scikit-learn 1.2)

MAINT add parameter validation using BaseEstimator

668c0f8

glemaitre marked this pull request as draft June 18, 2024 08:46

jeromedockes reviewed Jun 18, 2024

View reviewed changes

TheooJ mentioned this pull request Jun 18, 2024

ENH Polars support in Joiner #945

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MAINT add parameter validation using BaseEstimator #958

MAINT add parameter validation using BaseEstimator #958

glemaitre commented Jun 17, 2024

TheooJ commented Jun 17, 2024

glemaitre commented Jun 18, 2024

jeromedockes commented Jun 18, 2024

jeromedockes Jun 18, 2024

TheooJ Jun 18, 2024

glemaitre Jun 18, 2024 •

edited

Loading

jeromedockes Jun 18, 2024

glemaitre commented Jun 18, 2024

jeromedockes commented Jun 18, 2024 •

edited

Loading

MAINT add parameter validation using BaseEstimator #958

Are you sure you want to change the base?

MAINT add parameter validation using BaseEstimator #958

Conversation

glemaitre commented Jun 17, 2024

TheooJ commented Jun 17, 2024

glemaitre commented Jun 18, 2024

jeromedockes commented Jun 18, 2024

jeromedockes Jun 18, 2024

Choose a reason for hiding this comment

TheooJ Jun 18, 2024

Choose a reason for hiding this comment

glemaitre Jun 18, 2024 • edited Loading

Choose a reason for hiding this comment

jeromedockes Jun 18, 2024

Choose a reason for hiding this comment

glemaitre commented Jun 18, 2024

jeromedockes commented Jun 18, 2024 • edited Loading

glemaitre Jun 18, 2024 •

edited

Loading

jeromedockes commented Jun 18, 2024 •

edited

Loading