Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add how argument to join_where to support different join types #19962

Draft
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

adamreeve
Copy link
Contributor

Fixes #18669

This adds a new how argument to join_where that supports a subset of the join types of the join method.

I've opened this as a draft PR initially to get feedback on the approach before completing the implementation. For now the only extra types of joins that I've implemented are left joins with extra non-equi predicates, and left IE joins without any extra predicates.

@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars labels Nov 25, 2024
Copy link

codecov bot commented Nov 25, 2024

Codecov Report

Attention: Patch coverage is 86.25000% with 55 lines in your changes missing coverage. Please review.

Project coverage is 79.47%. Comparing base (3fad074) to head (324fd31).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
crates/polars-plan/src/plans/conversion/join.rs 79.48% 16 Missing ⚠️
...s/polars-ops/src/frame/join/dispatch_left_right.rs 88.98% 13 Missing ⚠️
crates/polars-mem-engine/src/executors/join.rs 65.62% 11 Missing ⚠️
crates/polars-plan/src/plans/visitor/hash.rs 66.66% 7 Missing ⚠️
crates/polars-lazy/src/frame/mod.rs 63.63% 4 Missing ⚠️
crates/polars-ops/src/frame/join/iejoin/mod.rs 95.00% 3 Missing ⚠️
crates/polars-plan/src/plans/ir/mod.rs 0.00% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff            @@
##             main   #19962    +/-   ##
========================================
  Coverage   79.47%   79.47%            
========================================
  Files        1555     1555            
  Lines      216318   216670   +352     
  Branches     2456     2456            
========================================
+ Hits       171919   172207   +288     
- Misses      43841    43905    +64     
  Partials      558      558            

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.


🚨 Try these New Features:

Copy link
Member

@ritchie46 ritchie46 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the use case of the extra join type. But why are there extra predicates passed to the physical joins? Can you elaborate more on that use case.

Ideally I apply them as filters on the logical plan.

@@ -497,6 +498,7 @@ pub(crate) fn into_py(py: Python<'_>, plan: &IR) -> PyResult<PyObject> {
)
.to_object(py)
},
// TODO: Add extra_predicates
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should already be added an raise NotImplemented.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The GPU executor should be able to handle these transparently, because it has support for mixed equality + arbitrary expression joins, so it would be good to just pass them through. I think this is a Vec<JoinPredicate> and a JoinPredicate contains two ExprIRs and an operator, so I think (untested...)

Add

#[pyo3(get)]
extra_predicates: Vec<(PyExprIR, PyExprIR, PyOperator)>,

To the Join struct in this file and

extra_predicates: extra_predicates.iter().map(|jp| (jp.left_on, jp.right_on, jp.op).into()).collect(),

here.

if !extra_predicates.is_empty() {
// TODO: How to handle this? Can we just add them back to predicates?
// Do we need to convert any IEJoins back to the non-IE join type specified to join_where?
panic!("Cannot convert IR back to LP with extra predicates");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not panic. IR cannot be completely mapped back to DSL, it is a best effort. If we can we must recreate the predicates.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this was just a placeholder to remind myself to fix this later, I guess it should have been a todo rather than panic

@adamreeve
Copy link
Contributor Author

I understand the use case of the extra join type. But why are there extra predicates passed to the physical joins? Can you elaborate more on that use case.

Ideally I apply them as filters on the logical plan.

I don't think applying them as filters would really work very well. To implement a left join, if you did a normal left join followed by filters for any non-equi predicates, the filter operation would need to convert RHS values to null rather than remove rows, but in some cases you would remove rows if they correspond to a LHS row that has other matches where the extra predicates are true. So you'd need to do something like add a temporary row index to know when result rows came from the same LHS input row. Or alternatively you could do an inner join with a row index added to the LHS, followed by filters, then add back LHS rows that aren't in the result. I think the approach I'm suggesting here is a lot simpler.

I explained my reasoning for this approach a bit more in #18669 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support additional join types with join_where
3 participants