-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] duckdb read_csv_auto kwarg columns clashes with load_df #376
Comments
The easiest thing might be to do something like ... def visitFugueLoadTask(self, ctx: fp.FugueLoadTaskContext) -> WorkflowDataFrame:
data = self.get_dict(ctx, "fmt", "path", "params", "columns")
__modified_exception__ = self.to_runtime_error(ctx) # noqa
params = data.get("params", {})
try:
columns = data["columns"]
except:
columns = params.pop("columns", "")
return self.workflow.load(
path=data["path"],
fmt=data.get("fmt", ""),
columns=columns,
**params,
) ... though I'm not too familiar with testing this under |
Ok, I see there are a couple of issues, look at this code def read_text_file(filepath: str) -> DataFrame:
headers = read_header(filepath)
engine = DuckExecutionEngine()
return engine.load_df(csv_filepath, skip=2, columns=headers)
csv_filepath = create_temporary_file(content, suffix=".csv")
dag = FugueWorkflow()
df = dag.create(read_text_file, params={"filepath": csv_filepath})
df.show()
dag.run(engine="duck") It may work, but it works with luck, the best way is: def read_text_file(engine:ExecutionEngine, filepath: str) -> DataFrame:
headers = read_header(filepath)
return engine.load_df(csv_filepath, skip=2, columns=headers) So you don't instantiate an engine by yourself. And this could also work with different engines. The workflow part stays the same. |
I think he had to instantiate the engine because there was a bit of inconsistent behavior between the Pandas and DuckDB engines when reading multi-header CSVs. In this issue in the tutorials repo, he wrote the following code to make things consistent: def read_text_file(engine: ExecutionEngine, filepath: str) -> DataFrame:
headers = read_header(filepath)
if isinstance(engine, NativeExecutionEngine):
# load_df uses pandas.read_csv
df = engine.load_df(filepath, infer_schema=True, header=True, skiprows=3, names=headers)
elif isinstance(engine, DuckExecutionEngine):
# load_df uses duckdb read_csv_auto
df = engine.load_df(filepath, infer_schema=True, skip=4, columns=headers)
elif isinstance(engine, DaskExecutionEngine):
# load_df uses dask.dataframe.read_csv
df = engine.load_df(filepath, infer_schema=True, header=True, skiprows=3, names=headers)
else:
supported_engines = {NativeExecutionEngine, DuckExecutionEngine, DaskExecutionEngine}
raise ValueError(f"Engine {engine} is not supported, must be one of {supported_engines}")
return df Native engine and Duck engine have different values for I think the takeaway here is that using the |
This code has multiple issues: # ... I can't easily use `fsql` instead as `columns` clashes ...
def read_text_file(filepath: str) -> DataFrame:
headers = read_header(filepath)
return fsql(f"LOAD '{filepath}' (skip=2, columns={headers})")
csv_filepath = create_temporary_file(content, suffix=".csv")
dag = FugueWorkflow()
df = dag.create(read_text_file, params={"filepath": csv_filepath})
df.show() First, I think your first solution is better, CSV needs heavy customization, using a creator to wrap the complexity is the better way. Second, you could do something like: schema = ",".join([x+":str" for x in headers])
fsql(f"LOAD '{filepath}' (skip=2) COLUMNS {schema}") But, you already see it's tedious and not so intuitive, and actually it still can't work, this is because Third, So what you could do, if you really want a programmatical solution, may be like this: from fugue_sql import FugueSQLWorkflow
schema = ",".join([x+":str" for x in headers])
dag = FugueSQLWorkflow()
df = dag(f"LOAD '{filepath}' (skip=2) COLUMNS {schema}")
df.show()
But this interface is not supposed to be used by end users, and actually we are going to merge |
Reading CSV is extremely hard to unify, by unifying it, we actually just create a new rule that may or may not be reasonable. And as you can see none of the CSV reading functions from different backends are perfect, and they are all drastically different. On the other hand, unifying CSV reading makes people want to stay with CSV, that is what we don't want to see. We want people to move away from CSV as early as possible. So it is hard to justify the effort to further improve the CSV features, at least for now, we can't prioritize it, users can create Creators to read their special CSVs. By the way, I think to have if-else on Engines inside a custom function is not a good practice. Remember Fugue should make the coupling very loose, but this code is doing the opposite. Instead, if you know you will only use DuckDB, you can do this: from duckdb import DuckDBPyRelation, DuckDBPyConnection
def read_text_file(engine:DuckDBPyConnection, filepath: str) -> DuckDBPyRelation:
headers = read_header(filepath)
return engine.from_csv_auto(...) This way, your creator is totally independent from Fugue, and can only work with duckdb backend. DuckDB backend can recognize (convert) fugue/fugue_duckdb/registry.py Line 76 in 68975b4
|
Thanks @goodwanghan & @kvnkho I'm happy for my particular job to be from duckdb import DuckDBPyRelation, DuckDBPyConnection
def read_text_file(engine:DuckDBPyConnection, filepath: str) -> DuckDBPyRelation:
headers = read_header(filepath)
return engine.from_csv_auto(...) is a good fit :) |
Minimal Code To Reproduce
Describe the bug
It seems that the
fugue
execution engine kwargcolumns
clashes withduckdb
columns
on parsing thesql
......
columns
is passed withinparams
and not ascolumns
Expected behavior
fsql
should parseLOAD ...
so thatengine.load_df
receives the same arguments as using it directlyEnvironment (please complete the following information):
Mentioned in fugue-project/tutorials#170
Example adapted from fugue-project/tutorials#178 &
The text was updated successfully, but these errors were encountered: