-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Save partitioned parquet dataset using NativeExecutionEngine #285
Comments
@LaurentErreca mentioned on Slack that the above modification fails on the def test_io(self):
path = os.path.join(self.tmpdir, "a")
path2 = os.path.join(self.tmpdir, "b.test.csv")
with self.dag() as dag:
b = dag.df([[6, 1], [2, 7]], "c:int,a:long")
b.partition(num=3).save(path, fmt="parquet", single=True)
b.save(path2, header=True)
assert FileSystem().isfile(path) And it fails in the What does the assert do? In this line: Why does it fail? Tracing the code a bit, the code path goes like this (but you can ignore the first 2):
The issue here is that passing What is the fix? From what it looks like, Dask actually doesn't have this functionality as well (but Spark does), so the Other considerations The from fugue.collections.partition import PartitionSpec
p = PartitionSpec(num=3)
SparkExecutionEngine().save_df(sdf, path, "parquet", "overwrite", p, True)
print(p) and more info can be found here. It will be quite hard and probably unnecessary to support the use case where the |
We will try to solve it in #296 |
Hi, |
When using Dask execution engine, we can use parameter |
Hey, @LaurentErreca it seems I have no access to your fork. But it is exciting to see you are working on it. I am looking forward to it!! |
Oh I can see it now, never mind let me take a look. I will get back to you tomorrow. |
Hi!
I would be happy to discuss about that with you!
Cheers,
Laurent
…On Tue, Mar 1, 2022, 9:53 AM Han Wang ***@***.***> wrote:
Oh I can see it now, never mind let me take a look.
—
Reply to this email directly, view it on GitHub
<#285 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AFZ7LJRQZGJG7CZVXYC6WXLU5XLIJANCNFSM5LOM2WPQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
…utionEngine and DaskExecutionEngine (#306) * Work in progress to fix issue 285 reported here #285 * Use option partition_on in Dask execution engine to write hive partitioned dataset * Add handling for spark array type (#307) * adding ecosystem to README * adding ecosystem to README * merge conflict * Fugue plugin (#311) * plugin * update * upgrading black version * fixing black version * Work in progress to fix issue 285 reported here #285 * Use option partition_on in Dask execution engine to write hive partitioned dataset * Handle hive partitioning with Duckdb execution engine * Clean code with pylint * Use ArrowDataFrame(df.as_arrow()) instead of ArrowDataFrame(df.native.arrow()) Co-authored-by: WangCHX <[email protected]> Co-authored-by: Kevin Kho <[email protected]> Co-authored-by: Han Wang <[email protected]>
I failed to execute following command with NativeExecutionEngine:
SAVE PREPARTITION BY PRODUCT OVERWRITE PARQUET '{{output_path}}'
I received the following message:
partition_spec is not respected in NativeExecutionEngine.save_df
This should be possible using pandas as function to_parquet when engine='pyarrow' has an argument partition_cols to do the job.
I have quickly tested this:
In file native_execution_engine.py, in function save_df (line 369):
This worked for me.
Cheers,
Laurent
Environment :
The text was updated successfully, but these errors were encountered: