Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nit: __post_init__ vs pydantic @model_validator(mode="after") #204

Closed
TomNicholas opened this issue Jul 30, 2024 · 4 comments · Fixed by #210
Closed

Nit: __post_init__ vs pydantic @model_validator(mode="after") #204

TomNicholas opened this issue Jul 30, 2024 · 4 comments · Fixed by #210

Comments

@TomNicholas
Copy link
Collaborator

In the ZArray class then after #193 we now have multiple types of validation checks, which together look like this:

from pydantic import (
    BaseModel,
    field_validator,
    model_validator,
)

class ZArray(BaseModel):
    ...

    @field_validator("dtype")
    @classmethod
    def validate_dtype(cls, dtype) -> np.dtype:
        # Your custom validation logic here
        # Convert numpy.dtype to a format suitable for Pydantic
        return np.dtype(dtype)

    def __post_init__(self) -> None:
        if len(self.shape) != len(self.chunks):
            raise ValueError(
                "Dimension mismatch between array shape and chunk shape. "
                f"Array shape {self.shape} has ndim={self.shape} but chunk shape {self.chunks} has ndim={len(self.chunks)}"
            )

    @model_validator(mode="after")
    def _check_fill_value(self) -> Self:
        if self.fill_value is None:
            self.fill_value = ZARR_DEFAULT_FILL_VALUE.get(self.dtype, 0.0)
        return self

It seems to me like having both __post_init__ and @model_validator(mode="after") is unnecessarily complicated. I think _check_fill_value should just be rewritten using @field_validator like validate_dtype is.

The check inside __post_init__ can't be done using field_validator as it involves comparing two correctly-instatiated fields against one another. It is not clear to me whether best practice with pydantic is to use __post_init__ or @model_validator(mode="after") - both seem supported, and to do similar things.

@ghidalgo3 is there some subtlety I've missed here or can I clean this up a bit? (I know eventually this should all be replaced by ZarrV3Metadata in #175.)

@ghidalgo3
Copy link
Contributor

_check_fill_value depends on both self.fill_value and self.dtype so it needs to be a @model_validator and not just a @field_validator. As for using __post_init__, I think it can be replaced by another @model_validator function with a better name like _chunk_dimensions_match_data_check.

It seems to me like having both post_init and @model_validator(mode="after") is unnecessarily complicated.

I agree. My interpretation (and I am far from a Python expert) is that __post_init__ is a stepping stone from stdlib.dataclass to a pydantic model. If you're already starting with pydantic, there's no need to have a __post_init__ and instead you should write @model_validators.

However, one very valid argument for __post_init__ + dataclasses is that @model_validator is a pydantic>=2.0 feature and __post_init__ still works in pydantic>=1.10 if you use it dataclasses. A surprising number of libraries are still incompatible with pydantic>=2, and I even bit my own tongue after adding the call to @model_validator because it then forced a dependency on pydantic2 that I wasn't expecting!

If you do consolidate everything around model_validator, it might be a good idea to also put a version constraint in the pyproject.toml for pydantic to let users know they need pydantic>=2.0 (or whenever @model_validator was introduced).

@TomNicholas
Copy link
Collaborator Author

_check_fill_value depends on both self.fill_value and self.dtype so it needs to be a @model_validator and not just a @field_validator.

👍 fair point.

Interesting point about dependencies. I would have thought though, that we actually should be trying to move away from having any pydantic dependency at all... Zarr upstream specifically chose not to have that dependency, and the main place we use it is ZArray, which we're hoping to replace with zarr.V3MetaData. (The other places we used it were in the ChunkManifest, which we stopped needing after switching to numpy arrays as an intermediate format, and in the ChunkEntry, which honestly doesn't do that much as a class.)

As for pydantic v1 vs v2 I think the fact we're planning to get rid of it makes it less of an issue. We also do require relatively recently versions of packages anyway due to our numpy>=2.0 dependency.

Overall 🤷‍♂️ Maybe let's just use __post_init__ on the basis that it's closer to ArrayV3Metadata, which is a dataclass?

@ghidalgo3
Copy link
Contributor

Sounds good to me! Then you'll need to make ZArray a @dataclass and move all the field validation logic into __post_init__.

@TomAugspurger
Copy link
Contributor

+1 to moving away from pydantic and just using dataclasses. I can take that on at some point this week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants