You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I searched in the issues and found nothing similar.
Motivation
Partition pushdown is a performance optimization technique that allows the query engine to filter out unnecessary data early in the query processing pipeline. By pushing down partition filters, we can significantly reduce the amount of data transferred and processed, leading to improved query performance and resource efficiency.
Consider a scenario where a user queries a large dataset partitioned by region and date. Without partition pushdown, the entire dataset needs to be scanned, which is inefficient. With partition pushdown, only the relevant partitions (e.g., data for a specific region and date range) are scanned, resulting in faster query execution and reduced resource usage.
wuchong
changed the title
[Feature] Support partition pushdown for Flink connector
[Feature] Support partition pushdown in Flink connector
Dec 16, 2024
now flussAdmin.listPartitionInfos only return the partition value, missing partition key. Flink require return a map contains partition key and value.so it's must be blocked by #195
@Alibaba-HZY you can get the partition keys from the Table#getDescriptor(). You can implement partition pushdown for single partition key first, so it is not blocked by #195.
For multiple partition pushdown, yes, we need to extend Admin#listPartitionInfos to return a map of partition keys and values. That can be done in/after #195.
After the discussion with @luoyuxia in stream mode :we cannot determine the correct number of partitions such as table has three partitions ds=11 ds=12 ds=13, sql :select * from table where ds > 10, applyPartition input ds=11 ds=12 ds=13,then sourceEnumerator will recieve this, but the user might write ds=14.sourceEnumerator will not get ds=14. in batch mode :only support datalake enabled or point queries on primary key,#40
is not closed.
So I think the partitionPushDown will not take effect
@Alibaba-HZY yes. for batch mode, we have to wait #40, and for streaming mode, we need to leverage SupportsFilterPushDown instead of SupportsPartitionPushDown. For example, if there is a where ds > 10, then you don't need to read partition from 0 to 10.
Search before asking
Motivation
Partition pushdown is a performance optimization technique that allows the query engine to filter out unnecessary data early in the query processing pipeline. By pushing down partition filters, we can significantly reduce the amount of data transferred and processed, leading to improved query performance and resource efficiency.
Consider a scenario where a user queries a large dataset partitioned by
region
anddate
. Without partition pushdown, the entire dataset needs to be scanned, which is inefficient. With partition pushdown, only the relevant partitions (e.g., data for a specific region and date range) are scanned, resulting in faster query execution and reduced resource usage.Solution
FlinkTableSource
to implementSupportsPartitionPushDown
and pushdown partitions.FlinkSourceEnumerator
only discovers buckets for the specific partitions to read.Anything else?
No response
Willingness to contribute
The text was updated successfully, but these errors were encountered: