Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot create a pydantic model with a pandera.typing.pyspark.DataFrame type. #1446

Open
3 tasks done
brayan07 opened this issue Dec 12, 2023 · 5 comments
Open
3 tasks done
Labels
bug Something isn't working

Comments

@brayan07
Copy link

brayan07 commented Dec 12, 2023

Describe the bug
A clear and concise description of what the bug is.

Pydantic models always throw is_instance_of validation errors if a pandera.typing.pyspark.DataFrame type is used. Pydantic integration with pyspark dataframes is broken.

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of pandera.
  • (optional) I have confirmed this bug exists on the master branch of pandera.

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

import pyspark.sql.types as T

from pandera.pyspark import DataFrameModel, Field
from pandera.typing.pyspark import DataFrame
from pydantic import BaseModel
from pyspark.sql import SparkSession


class SampleSchema(DataFrameModel):
    """
    Sample schema model with data checks.
    """

    product: T.StringType() = Field()
    price: T.IntegerType() = Field()


class PydanticContainer(BaseModel):
    """
    Pydantic container with a DataFrameModel as a field.
    """

    data: DataFrame[SampleSchema]

    class Config:
        arbitrary_types_allowed = True


data = [("Bread", 9), ("Butter", 15)]
schema = (
    T.StructType(
        [
            T.StructField("product", T.StringType()),
            T.StructField("price", T.IntegerType()),
        ],
    )
)

spark = SparkSession.builder.appName("Pandera Pyspark Testing").getOrCreate()
data_df = spark.createDataFrame(data, schema=schema)

# Instantiating the PydanticContainer leads to a ValidationError
my_container = PydanticContainer(data=data_df)

The above leads to the following error:

tests/pyspark/test_scratch.py:38 (test_run)
def test_run():
        spark = SparkSession.builder.appName("Pandera Pyspark Testing").getOrCreate()
        data_df = spark.createDataFrame(data, schema=schema)
>       my_container = PydanticContainer(data=data_df)
E       pydantic_core._pydantic_core.ValidationError: 1 validation error for PydanticContainer
E       data
E         Input should be an instance of DataFrame [type=is_instance_of, input_value=DataFrame[product: string, price: int], input_type=DataFrame]
E           For further information visit https://errors.pydantic.dev/2.5/v/is_instance_of

test_scratch.py:42: ValidationError

Expected behavior

A clear and concise description of what you expected to happen.
We would expect the PydanticContainer to instantiate successfully. The error says that the DataFrame we're feeding in is not a DataFrame.

Desktop (please complete the following information):

  • OS: [e.g. iOS] MacOS Ventura 13.5
  • Browser [e.g. chrome, safari] Chrome
  • Version [e.g. 22] 119.0.6045.199

Screenshots

If applicable, add screenshots to help explain your problem.

Additional context

Add any other context about the problem here.

@brayan07 brayan07 added the bug Something isn't working label Dec 12, 2023
@cosmicBboy
Copy link
Collaborator

The pyspark.sql pandera backend does not currently support pydantic types. The current behavior is designed to only work with pyspark types.

Going to change this to an enhancement ticket, will need discussion with the defacto code owners for the pyspark.sql integration: @NeerajMalhotra-QB @jaskaransinghsidana.

@cosmicBboy cosmicBboy added enhancement New feature or request and removed bug Something isn't working labels Dec 12, 2023
@cosmicBboy
Copy link
Collaborator

Ah, okay I misread this issue! You want to use a pandera pyspark.sql schema in your pydantic models, correct? This should actually work, reverting this to a bug.

Open to contributions for this.

@cosmicBboy cosmicBboy added bug Something isn't working and removed enhancement New feature or request labels Dec 12, 2023
@NeerajMalhotra-QB
Copy link
Collaborator

Just looking at the code above I suspect the issue is your import from pandera.typing.pyspark import DataFrame which might be pointing to pyspark.pandas.DataFrame and not PySpark Sql. I haven't digged into this but it appears to be the issue to me.

@brayan07
Copy link
Author

brayan07 commented Dec 13, 2023

I get the same error with both:
from pandera.typing.pyspark import DataFrame
and
from pandera.typing.pyspark_sql import DataFrame

I have a fix working locally and will submit a PR for this in the next couple of days.

@brayan07
Copy link
Author

Sumbitted a bugfix in #1447 for review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants