Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert to Delta table support #473

Merged
merged 4 commits into from
Dec 1, 2023
Merged

Convert to Delta table support #473

merged 4 commits into from
Dec 1, 2023

Conversation

gruuya
Copy link
Contributor

@gruuya gruuya commented Nov 27, 2023

Convert a parquet table, as specified by a particular path, into a Delta table. Syntax closely follows that of Databricks.

  • The present implementation supports only in-place conversion, meaning that the Parquet table should be stored in a UUID dir inside the object store root (meaning bucket + any prefix).
  • Note that delta-rs doesn't support appending/overwriting existing tables for now.
  • This also lacks support for partitioned parquet tables—those could in principle be supported by extending the proposed syntax to allow passing the partitioning scheme, but since we don't yet support partitioned tables in general (see Add support for partitioned tables #476) I have opted out of that.
  • Finally one other thing that's missing support is on-the-fly casting/coercion to delta supported arrow types subset. In other words if a Parquet file contains timestamp columns with s, ms or ns resolution this will error out.

Closes #469.

Copy link
Contributor

@mildbyte mildbyte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(just some initial thoughts, not a review)

Comment on lines +246 to +249
let fields: Vec<(String, String)> = schema.fields()
.iter()
.map(|f| (f.name().clone(), field_to_json(f).to_string()))
.collect();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, it's already implemented in

seafowl/src/schema.rs

Lines 37 to 43 in 3d428c9

pub fn to_column_names_types(&self) -> Vec<(String, String)> {
self.arrow_schema
.fields()
.iter()
.map(|f| (f.name().clone(), field_to_json(f).to_string()))
.collect()
}
, why the duplication? could pull that out into a function that schema::Schema and this both call if you want to not use the schema::Schema wrapper struct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started writing a comment, but then it got a bit out of hand, so I figured it warrants an issue on it's own: #475

src/context.rs Outdated
Comment on lines 732 to 744
self.internal_object_store
.copy_in_prefix(&path, &table_prefix)
.await?;

// Now convert them to a Delta table
ConvertToDeltaBuilder::new()
.with_log_store(table_log_store)
.with_table_name(&*table_name)
.with_comment(format!(
"Converted by Seafowl {}",
env!("CARGO_PKG_VERSION")
))
.await?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we do want to support completely-zero copy, but that means that either the source files have to be in a UUID-named directory inside of a prefix, or we have to track the full path for each table in the catalog (some intersection here with being able to persist external tables, including external Delta tables, probably an extension of #472 where we assume all tables have the same prefix), or we do a destructive convert-to-delta and move all Parquet files to a UUID-named directory.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have to track the full path for each table in the catalog

Yeah, this makes the most sense to me going forward, but we'll need to think about the new catalog schema for that.

There's also a possibility of optionally outsourcing this info (completely or partially) to a 3rd party data catalog service, e.g. see here for how delta-rs envisions that: https://github.com/delta-io/delta-rs/blob/main/crates/deltalake-core/src/data_catalog/glue/mod.rs

For now though, I'm going to go with zero-copy, and assume the files are stored at exactly the right location (bucket + optional prefix + new table UUID).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense for now - note the Parquet writer has to know that it has to generate a UUID to put the table into.

@gruuya gruuya force-pushed the object-store-root-in-sub-folder branch from c28c067 to ae09ec3 Compare November 28, 2023 09:27
Base automatically changed from object-store-root-in-sub-folder to main November 28, 2023 10:23
@gruuya gruuya linked an issue Nov 29, 2023 that may be closed by this pull request
@gruuya gruuya marked this pull request as ready for review November 29, 2023 14:00
@@ -272,20 +273,22 @@ pub async fn plan_to_object_store(
.collect()
}

pub(super) enum CreateDeltaTableDetails {
WithSchema(Schema),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: Maaaaybe this should be called EmptyTable instead of WithSchema

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, I'm not too happy with the enum/variants name anyway.

// COPY some values multiple times to test converting flat table with more than one parquet file
context
.plan_query(&format!(
"COPY (VALUES (1, 'one'), (2, 'two')) TO '{}/file_1.parquet'",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't realize we / DataFusion could do that 😅

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, ever since #462

@gruuya gruuya merged commit a6e4bec into main Dec 1, 2023
1 check passed
@gruuya gruuya deleted the convert-to-delta-stmt branch December 1, 2023 16:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement conversion of parquet to Delta tables
2 participants