Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support additional join types with join_where #18669

Open
adamreeve opened this issue Sep 10, 2024 · 6 comments · May be fixed by #19962
Open

Support additional join types with join_where #18669

adamreeve opened this issue Sep 10, 2024 · 6 comments · May be fixed by #19962
Labels
enhancement New feature or an improvement of an existing feature

Comments

@adamreeve
Copy link
Contributor

Description

The join_where method added in #18365 behaves like an inner join, where only rows from the left and right DataFrames that match the condition are returned.

It would be useful to support other join types like left, full, anti and semi.

I'm not sure how feasible it is to implement this, it seems like it would be straightforward when the join resolves to a single equi join, or single iejoin, but handling additional filter conditions could be tricky.

@adamreeve adamreeve added the enhancement New feature or an improvement of an existing feature label Sep 10, 2024
@ritchie46
Copy link
Member

For equality predicates this is fairly trivial. Choose the equality predicates as outer join and add the remaining as filters. For inequality I am not entirely sure how to interpret that yet. 🤔

I think the first goal is to support the single inequality predicate case, as that's low hanging fruit atm!

@AlexeyDmitriev
Copy link

AlexeyDmitriev commented Sep 11, 2024

I think interpretation should be the same as with ON statement in SQL:

You return all the pair of rows that match all the condiitions in on (condition is calculated in terms of result)

But for left/join/outer join you'll return also rows from left/right/both table that don't have a pair

@adamreeve
Copy link
Contributor Author

I think the first goal is to support the single inequality predicate case, as that's low hanging fruit atm!

Yep I've made a start on this, I should hopefully have something to show soon

@adamreeve
Copy link
Contributor Author

I've started looking into this and have pushed a commit with my idea of what the Python API and behaviour should look like: adamreeve@7b35d09

Actually implementing this looks like it could be a fair bit of work though. For a join on equality conditions only, we could trivially convert this to an inner, left, right, full, semi or outer join. And for a join using one or two inequalities, we could fairly easily update the way join results are materialised in the IEJoin implementation to handle the different join types. But any joins that end up with extra conditions in remaining_preds in resolve_join_where would be trickier to handle.

It looks like rather than adding an extra post-processing node after the join, the best way forward would be to add support for extra non-equality predicates to all of the join implementations in order to handle this correctly and efficiently.

Consider using the existing approach of a post-processing filter for left joins for example. If we do a left join or a left IE-join and then want to apply extra join predicates in a post-processing step, we could replace values in RHS table columns with nulls where any of the new conditions are false, but would end up with extra result rows for one LHS table row if it matches with multiple RHS rows and some are filtered out in post-processing. We could probably work around that by adding a temporary row index column, but this would get pretty clunky. Semi and anti joins would be more of a problem, we'd need to do an inner join first to keep enough information to apply the extra predicates, making this a lot less efficient than it should be.

@ritchie46
Copy link
Member

@adamreeve would it be more sensical to first implement a proper vectorized nested loop join? E.g. instead of a cross join + full materialization followed by filters we do a cartesian product on small batches + filters. This would be hit for predicates that cannot be used by IEJoin.

I feel that we can reuse some of that logic here.

@adamreeve
Copy link
Contributor Author

adamreeve commented Dec 5, 2024

Yeah that would make a lot of sense, we'd need that to handle left/full/semi/anti joins that don't use any equality or inequality conditions too as those couldn't be easily handled with a cross join and filter.

We could use that for all non-inner join_where operations initially and then do something like #19962 to optimise some scenarios.

For reference, this issue is related: #19654

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants