-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add how argument to join_where to support different join types #19962
base: main
Are you sure you want to change the base?
Conversation
… single-row DataFrame
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #19962 +/- ##
========================================
Coverage 79.47% 79.47%
========================================
Files 1555 1555
Lines 216318 216670 +352
Branches 2456 2456
========================================
+ Hits 171919 172207 +288
- Misses 43841 43905 +64
Partials 558 558 ☔ View full report in Codecov by Sentry. 🚨 Try these New Features:
|
03b7033
to
a0cfd67
Compare
a0cfd67
to
324fd31
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the use case of the extra join type. But why are there extra predicates passed to the physical joins? Can you elaborate more on that use case.
Ideally I apply them as filters on the logical plan.
@@ -497,6 +498,7 @@ pub(crate) fn into_py(py: Python<'_>, plan: &IR) -> PyResult<PyObject> { | |||
) | |||
.to_object(py) | |||
}, | |||
// TODO: Add extra_predicates |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should already be added an raise NotImplemented
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The GPU executor should be able to handle these transparently, because it has support for mixed equality + arbitrary expression joins, so it would be good to just pass them through. I think this is a Vec<JoinPredicate>
and a JoinPredicate
contains two ExprIR
s and an operator, so I think (untested...)
Add
#[pyo3(get)]
extra_predicates: Vec<(PyExprIR, PyExprIR, PyOperator)>,
To the Join
struct in this file and
extra_predicates: extra_predicates.iter().map(|jp| (jp.left_on, jp.right_on, jp.op).into()).collect(),
here.
if !extra_predicates.is_empty() { | ||
// TODO: How to handle this? Can we just add them back to predicates? | ||
// Do we need to convert any IEJoins back to the non-IE join type specified to join_where? | ||
panic!("Cannot convert IR back to LP with extra predicates"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not panic. IR cannot be completely mapped back to DSL, it is a best effort. If we can we must recreate the predicates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this was just a placeholder to remind myself to fix this later, I guess it should have been a todo
rather than panic
I don't think applying them as filters would really work very well. To implement a left join, if you did a normal left join followed by filters for any non-equi predicates, the filter operation would need to convert RHS values to null rather than remove rows, but in some cases you would remove rows if they correspond to a LHS row that has other matches where the extra predicates are true. So you'd need to do something like add a temporary row index to know when result rows came from the same LHS input row. Or alternatively you could do an inner join with a row index added to the LHS, followed by filters, then add back LHS rows that aren't in the result. I think the approach I'm suggesting here is a lot simpler. I explained my reasoning for this approach a bit more in #18669 (comment) |
Fixes #18669
This adds a new
how
argument tojoin_where
that supports a subset of the join types of thejoin
method.I've opened this as a draft PR initially to get feedback on the approach before completing the implementation. For now the only extra types of joins that I've implemented are left joins with extra non-equi predicates, and left IE joins without any extra predicates.