-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add strict
parameter to pl.concat(how='horizontal')
#20019
base: main
Are you sure you want to change the base?
Conversation
…d added a corresponding unit test
Heya, thank you for the PR. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #20019 +/- ##
=======================================
Coverage 79.52% 79.53%
=======================================
Files 1563 1563
Lines 217104 217121 +17
Branches 2464 2464
=======================================
+ Hits 172659 172690 +31
+ Misses 43885 43871 -14
Partials 560 560 ☔ View full report in Codecov by Sentry. |
py-polars/polars/functions/eager.py
Outdated
@@ -231,6 +240,14 @@ def concat( | |||
) | |||
) | |||
elif how == "horizontal": | |||
if strict: | |||
nrows = first.select(F.len()).collect()[0, 0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason this should be implemented on the rust side is that this collect
here could trigger a massive computation if the query plan is complex, which then gets tossed. The check should be performed when the concatenation operation is actually applied.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood. When I initially thought about it, I failed to take into account how I would compare the number of rows on Lazyframes.
@mcrumiller @coastalwhite Unfortunately, my machine is not capable of running |
@nimit I'm not a repo member, I just lurk here a lot, but I can try to help you get things working--what's the issue with running |
Thanks for your help! |
… single element and ddof=1 and there are nulls elsewhere in the Series (pola-rs#20077)
@nimit it's failing on the new streaming engine: ~/projects/polars/py-polars$ export POLARS_AUTO_NEW_STREAMING=1
~/projects/polars/py-polars$ pytest /home/mcrumiller/projects/polars/py-polars/tests/unit/functions/test_concat.py
=========================================================================================================================================================== test session starts ===========================================================================================================================================================
platform linux -- Python 3.12.6, pytest-8.3.2, pluggy-1.5.0
codspeed: 3.0.0 (disabled, mode: walltime, timer_resolution: 1.0ns)
rootdir: /home/mcrumiller/projects/polars/py-polars
configfile: pyproject.toml
plugins: cov-6.0.0, codspeed-3.0.0, hypothesis-6.119.4, xdist-3.6.1
collected 4 items / 2 deselected / 2 selected
tests/unit/functions/test_concat.py F. [100%]
================================================================================================================================================================ FAILURES =================================================================================================================================================================
_____________________________________________________________________________________________________________________________________________________ test_concat_horizontally_strict _____________________________________________________________________________________________________________________________________________________
tests/unit/functions/test_concat.py:32: in test_concat_horizontally_strict
with pytest.raises(pl.exceptions.ShapeError):
E Failed: DID NOT RAISE <class 'polars.exceptions.ShapeError'>
========================================================================================================================================================= short test summary info =========================================================================================================================================================
FAILED tests/unit/functions/test_concat.py::test_concat_horizontally_strict - Failed: DID NOT RAISE <class 'polars.exceptions.ShapeError'>
================================================================================================================================================ 1 failed, 1 passed, 2 deselected in 0.13s ================================================================================================================================================ I'll look into it. |
@nimit can you set this PR to draft until we can get this working? |
I'm not familiar at all with the new streaming engine. After taking a look, it looks like there is a parameter in there called We need to figure out how to propagate this parameter. Two places to start are |
I believe this should work?
And in
|
@nimit yep! That does it. The linter may complain about |
@mcrumiller any idea about the error? It is again failing using the streaming engine |
I believe that if one of the frames is length-1 there is some broadcasting happening. If we try lengths 2 and 3, it does raise: df1 = pl.LazyFrame({"a": [0, 1, 2], "b": [1, 2, 3]})
df2 = pl.LazyFrame({"c": [11, 22], "d": [33, 44]})
df = pl.concat([df1.lazy(), df2.lazy()], how="horizontal", strict=True).collect()
# polars.exceptions.ShapeError: zip node received non-equal length inputs If I set df2 to length 1, we get broadcasting: df1 = pl.LazyFrame({"a": [0, 1, 2], "b": [1, 2, 3]})
df2 = pl.LazyFrame({"c": [11], "d": [33]})
df = pl.concat([df1.lazy(), df2.lazy()], how="horizontal", strict=True).collect()
# shape: (3, 4)
┌─────┬─────┬─────┬─────┐
│ a ┆ b ┆ c ┆ d │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╪═════╡
│ 0 ┆ 1 ┆ 11 ┆ 33 │
│ 1 ┆ 2 ┆ 11 ┆ 33 │
│ 2 ┆ 3 ┆ 11 ┆ 33 │
└─────┴─────┴─────┴─────┘ Need to understand how the broadcasting here works. |
Is the broadcasting expected behavior though? Maybe it wouldn't broadcast if we declare df2 as, Also, the second variant (without broadcasting), generates an error but it doesn't say anything about
|
polars/crates/polars-stream/src/nodes/zip.rs Line 129 in 5f3e8a6
So negating I think the intent here is, when faced with a length-1 df concatenated with other dfs, then either we can 1) broadcast, or 2) fill nulls. Our
I wonder if the intent is to have some sort of |
This pull request adds the broadcasting/null extend. |
@nimit why don't you:
|
…lementation and added a more robust set of tests on concat
I'm at work so can't rust but I can help looking into the failures tonight. |
I think I figured out the intention behind the negation of |
That was my guess in my prior message. We may have to make this a new parameter in the streaming, since |
@mcrumiller I cannot think of a very low impact way of dealing with this. |
We can't. The entire point of streaming is that it applies to operations where the data "streams" in. You don't know that the heights are different until one df ends and the other doesn't. This has to be a parameter. |
Oh right... That must also mean that all multi-df operations would use Zip (right?). |
Looking at the Zip node definition, it's this: Zip {
inputs: Vec<PhysNodeKey>,
/// If true shorter inputs are extended with nulls to the longest input,
/// if false all inputs must be the same length, or have length 1 in
/// which case they are broadcast.
null_extend: bool,
}, This assumes two possible behaviors when confronted with a 1-row frame against a multi-row frame: broadcast or null-fill. When I think the way to go is to replace this parameter with an Enum called something like Let's see if we can get Ritchie's input here. |
@orlp might be able to produce some input next week |
PR that closes #19133
Made changes to the python package so that if how='horizontal', the number of rows in the first element are checked with the rest of the elements for both: lazy and eager DataFrames.
strict is set to False by default
Also added unit tests for the changes for cases: