You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Update the glue__create_tmp_table_as macro so that the file format uses parquet instead of flat file.
Describe alternatives you've considered
I have researched methods to decrease the amount of files created as the result of an insert overwrite incremental operation but none have been successful, except with the change suggested above. Use of the AQE to coalesce files with a configuration option such as spark.sql.adaptive.coalescePartitions.minPartitionSize or spark.sql.adaptive.coalescePartitions.minPartitionNum will not work as the final insert operation does not use shuffle partitions.
Additional context
Internal testing decreased the number of files created in a new partition by about 4x when the temporary table was created with parquet. Examining execution plans in SparkUI showed a lesser number of tasks generated on the scan from the temporary table during the insert into the base table. The tasks also packed in more rows versus when the temporary table was flatfile based.
Flat file based: 896 tasks w/ ~125k records each
Parquet based: 224 tasks w/ ~468k records each
Who will this benefit?
This should benefit anyone that utilizes insert overwrite incrementals.
Are you interested in contributing this feature?
I need to determine if I can submit a PR.
The text was updated successfully, but these errors were encountered:
Describe the feature
Update the
glue__create_tmp_table_as
macro so that the file format uses parquet instead of flat file.Describe alternatives you've considered
I have researched methods to decrease the amount of files created as the result of an
insert overwrite
incremental operation but none have been successful, except with the change suggested above. Use of the AQE to coalesce files with a configuration option such asspark.sql.adaptive.coalescePartitions.minPartitionSize
orspark.sql.adaptive.coalescePartitions.minPartitionNum
will not work as the final insert operation does not use shuffle partitions.Additional context
Internal testing decreased the number of files created in a new partition by about 4x when the temporary table was created with parquet. Examining execution plans in SparkUI showed a lesser number of tasks generated on the scan from the temporary table during the insert into the base table. The tasks also packed in more rows versus when the temporary table was flatfile based.
Flat file based: 896 tasks w/ ~125k records each
Parquet based: 224 tasks w/ ~468k records each
Who will this benefit?
This should benefit anyone that utilizes
insert overwrite
incrementals.Are you interested in contributing this feature?
I need to determine if I can submit a PR.
The text was updated successfully, but these errors were encountered: