Temporary table file format to parquet #409

johnnyC9000 · 2024-07-15T18:01:17Z

Describe the feature

Update the glue__create_tmp_table_as macro so that the file format uses parquet instead of flat file.

Describe alternatives you've considered

I have researched methods to decrease the amount of files created as the result of an insert overwrite incremental operation but none have been successful, except with the change suggested above. Use of the AQE to coalesce files with a configuration option such as spark.sql.adaptive.coalescePartitions.minPartitionSize or spark.sql.adaptive.coalescePartitions.minPartitionNum will not work as the final insert operation does not use shuffle partitions.

Additional context

Internal testing decreased the number of files created in a new partition by about 4x when the temporary table was created with parquet. Examining execution plans in SparkUI showed a lesser number of tasks generated on the scan from the temporary table during the insert into the base table. The tasks also packed in more rows versus when the temporary table was flatfile based.
Flat file based: 896 tasks w/ ~125k records each
Parquet based: 224 tasks w/ ~468k records each

Who will this benefit?

This should benefit anyone that utilizes insert overwrite incrementals.

Are you interested in contributing this feature?

I need to determine if I can submit a PR.

The text was updated successfully, but these errors were encountered:

johnnyC9000 added the enhancement New feature or request label Jul 15, 2024

johnnyC9000 changed the title ~~Temporary table file format to match base table~~ Temporary table file format to parquet Jul 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Temporary table file format to parquet #409

Temporary table file format to parquet #409

johnnyC9000 commented Jul 15, 2024

Temporary table file format to parquet #409

Temporary table file format to parquet #409

Comments

johnnyC9000 commented Jul 15, 2024

Describe the feature

Describe alternatives you've considered

Additional context

Who will this benefit?

Are you interested in contributing this feature?