-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support additional join types with join_where #18669
Comments
For equality predicates this is fairly trivial. Choose the equality predicates as outer join and add the remaining as filters. For inequality I am not entirely sure how to interpret that yet. 🤔 I think the first goal is to support the single inequality predicate case, as that's low hanging fruit atm! |
I think interpretation should be the same as with ON statement in SQL: You return all the pair of rows that match all the condiitions in on (condition is calculated in terms of result) But for left/join/outer join you'll return also rows from left/right/both table that don't have a pair |
Yep I've made a start on this, I should hopefully have something to show soon |
I've started looking into this and have pushed a commit with my idea of what the Python API and behaviour should look like: adamreeve@7b35d09 Actually implementing this looks like it could be a fair bit of work though. For a join on equality conditions only, we could trivially convert this to an inner, left, right, full, semi or outer join. And for a join using one or two inequalities, we could fairly easily update the way join results are materialised in the IEJoin implementation to handle the different join types. But any joins that end up with extra conditions in It looks like rather than adding an extra post-processing node after the join, the best way forward would be to add support for extra non-equality predicates to all of the join implementations in order to handle this correctly and efficiently. Consider using the existing approach of a post-processing filter for left joins for example. If we do a left join or a left IE-join and then want to apply extra join predicates in a post-processing step, we could replace values in RHS table columns with nulls where any of the new conditions are false, but would end up with extra result rows for one LHS table row if it matches with multiple RHS rows and some are filtered out in post-processing. We could probably work around that by adding a temporary row index column, but this would get pretty clunky. Semi and anti joins would be more of a problem, we'd need to do an inner join first to keep enough information to apply the extra predicates, making this a lot less efficient than it should be. |
@adamreeve would it be more sensical to first implement a proper vectorized nested loop join? E.g. instead of a cross join + full materialization followed by filters we do a cartesian product on small batches + filters. This would be hit for predicates that cannot be used by IEJoin. I feel that we can reuse some of that logic here. |
Yeah that would make a lot of sense, we'd need that to handle left/full/semi/anti joins that don't use any equality or inequality conditions too as those couldn't be easily handled with a cross join and filter. We could use that for all non-inner For reference, this issue is related: #19654 |
Description
The
join_where
method added in #18365 behaves like an inner join, where only rows from the left and right DataFrames that match the condition are returned.It would be useful to support other join types like left, full, anti and semi.
I'm not sure how feasible it is to implement this, it seems like it would be straightforward when the join resolves to a single equi join, or single iejoin, but handling additional filter conditions could be tricky.
The text was updated successfully, but these errors were encountered: