You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The dbldatagen package's DataAnalyzer won't accept a Spark Connect DataFrame (pyspark.sql.connect.dataframe.DataFrame) as input.
Current Behavior
The dbldatagen package's DataAnalyzer does not accept a Spark Connect DataFrame as input and raises an AssertionError with the message "df must be a valid Pyspark dataframe".
The isinstance check isinstance(df, DataFrame) returns False for a Spark DataFrame, even after converting it to a Pandas DataFrame and creating a new DataFrame using spark.createDataFrame(pdf).
Steps to Reproduce (for bugs)
import dbldatagen as dg
import pandas as pd
from pyspark.sql import DataFrame
dfSource = spark.sql("SELECT * FROM db.schema.table LIMIT 10000")
df = dfSource
analyzer = dg.DataAnalyzer(sparkSession=spark, df=dfSource)
generatedCode = analyzer.scriptDataGeneratorFromData()
print(generatedCode) #AssertionError: sourceDf must be a valid Pyspark dataframe
print(isinstance(df, DataFrame)) # Output: False
print(type(df)) # Output: <class 'pyspark.sql.connect.dataframe.DataFrame'> not pyspark.sql.dataframe.DataFrame
pdf = df.toPandas()
df_traditional = spark.createDataFrame(pdf)
print(isinstance(df_traditional, DataFrame)) # Output: False
print(type(df_traditional)) # Output: <class 'pyspark.sql.connect.dataframe.DataFrame'> not pyspark.sql.dataframe.DataFrame
Context
Code is attempting to use dbldatagen package to analyze/generate data based on an existing Spark DataFrame.
Spark DataFrame is created using Spark Connect, resulting in a pyspark.sql.connect.dataframe.DataFrame instead of traditional pyspark.sql.dataframe.DataFrame.
The dbldatagen package's DataAnalyzer expects a traditional PySpark DataFrame, pyspark.sql.dataframe.DataFrame, and raises an assertion error when provided with a Spark Connect DataFrame.
Thanks for raising this. We are working on preparing a new release with a number of feature updates and will look to incorporate a fix for this into the new release.
As a short term work around, we'll relax this check to a warning
Expected Behavior
Current Behavior
Steps to Reproduce (for bugs)
Context
Your Environment
dbldatagen
version used: 0.3.6The text was updated successfully, but these errors were encountered: