Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Temporary table file format to parquet #409

Open
johnnyC9000 opened this issue Jul 15, 2024 · 0 comments
Open

Temporary table file format to parquet #409

johnnyC9000 opened this issue Jul 15, 2024 · 0 comments
Labels
enhancement New feature or request

Comments

@johnnyC9000
Copy link

Describe the feature

Update the glue__create_tmp_table_as macro so that the file format uses parquet instead of flat file.

Describe alternatives you've considered

I have researched methods to decrease the amount of files created as the result of an insert overwrite incremental operation but none have been successful, except with the change suggested above. Use of the AQE to coalesce files with a configuration option such as spark.sql.adaptive.coalescePartitions.minPartitionSize or spark.sql.adaptive.coalescePartitions.minPartitionNum will not work as the final insert operation does not use shuffle partitions.

Additional context

Internal testing decreased the number of files created in a new partition by about 4x when the temporary table was created with parquet. Examining execution plans in SparkUI showed a lesser number of tasks generated on the scan from the temporary table during the insert into the base table. The tasks also packed in more rows versus when the temporary table was flatfile based.
Flat file based: 896 tasks w/ ~125k records each
Parquet based: 224 tasks w/ ~468k records each

Who will this benefit?

This should benefit anyone that utilizes insert overwrite incrementals.

Are you interested in contributing this feature?

I need to determine if I can submit a PR.

@johnnyC9000 johnnyC9000 added the enhancement New feature or request label Jul 15, 2024
@johnnyC9000 johnnyC9000 changed the title Temporary table file format to match base table Temporary table file format to parquet Jul 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant