Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Support partition pushdown in Flink connector #196

Open
1 of 2 tasks
wuchong opened this issue Dec 16, 2024 · 6 comments
Open
1 of 2 tasks

[Feature] Support partition pushdown in Flink connector #196

wuchong opened this issue Dec 16, 2024 · 6 comments
Assignees
Labels
component=flink feature New feature or request
Milestone

Comments

@wuchong
Copy link
Member

wuchong commented Dec 16, 2024

Search before asking

  • I searched in the issues and found nothing similar.

Motivation

Partition pushdown is a performance optimization technique that allows the query engine to filter out unnecessary data early in the query processing pipeline. By pushing down partition filters, we can significantly reduce the amount of data transferred and processed, leading to improved query performance and resource efficiency.

Consider a scenario where a user queries a large dataset partitioned by region and date. Without partition pushdown, the entire dataset needs to be scanned, which is inefficient. With partition pushdown, only the relevant partitions (e.g., data for a specific region and date range) are scanned, resulting in faster query execution and reduced resource usage.

Solution

Anything else?

No response

Willingness to contribute

  • I'm willing to submit a PR!
@wuchong wuchong added feature New feature or request component=flink labels Dec 16, 2024
@wuchong wuchong added this to the v0.6 milestone Dec 16, 2024
@wuchong wuchong changed the title [Feature] Support partition pushdown for Flink connector [Feature] Support partition pushdown in Flink connector Dec 16, 2024
@Alibaba-HZY
Copy link

I would like to contribute this issue, can assign it to me

@Alibaba-HZY
Copy link

now flussAdmin.listPartitionInfos only return the partition value, missing partition key. Flink require return a map contains partition key and value.so it's must be blocked by #195

@wuchong
Copy link
Member Author

wuchong commented Dec 16, 2024

@Alibaba-HZY you can get the partition keys from the Table#getDescriptor(). You can implement partition pushdown for single partition key first, so it is not blocked by #195.

For multiple partition pushdown, yes, we need to extend Admin#listPartitionInfos to return a map of partition keys and values. That can be done in/after #195.

@Alibaba-HZY
Copy link

@wuchong partition pushdown be executed only in batch mode?if so now batch mode only support datalake enabled or point queries on primary key.

@Alibaba-HZY
Copy link

After the discussion with @luoyuxia
in stream mode :we cannot determine the correct number of partitions such as table has three partitions ds=11 ds=12 ds=13, sql :select * from table where ds > 10, applyPartition input ds=11 ds=12 ds=13,then sourceEnumerator will recieve this, but the user might write ds=14.sourceEnumerator will not get ds=14.
in batch mode :only support datalake enabled or point queries on primary key,#40
is not closed.
So I think the partitionPushDown will not take effect

@wuchong
Copy link
Member Author

wuchong commented Dec 20, 2024

@Alibaba-HZY yes. for batch mode, we have to wait #40, and for streaming mode, we need to leverage SupportsFilterPushDown instead of SupportsPartitionPushDown. For example, if there is a where ds > 10, then you don't need to read partition from 0 to 10.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component=flink feature New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants