-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(rust): Validate column names in unique()
for empty DataFrames
#20411
base: main
Are you sure you want to change the base?
Conversation
Ensures that column names in the subset parameter are validated even when the dataframe is empty, maintaining consistent behavior with non-empty dataframes.
Add test cases to verify that unique() properly handles invalid column names in subset parameter for both empty and non-empty dataframes.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #20411 +/- ##
==========================================
+ Coverage 78.97% 79.02% +0.04%
==========================================
Files 1562 1562
Lines 220103 220171 +68
Branches 2486 2486
==========================================
+ Hits 173821 173981 +160
+ Misses 45709 45617 -92
Partials 573 573 ☔ View full report in Codecov by Sentry. |
These error should be raised during conversion to IR, not at the implementation level. |
@ritchie46 thanks for having a look. I am new to contributing to |
let cols = expand_selectors(s, input_schema.as_ref(), &[])?; | ||
|
||
// Checking if subset columns exist in the dataframe | ||
cols.iter() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bit nitpicky, but I think this can be written a little bit less verbose.
for c in &cols {
let _ = input_schema.try_get(name)?
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haha no problem. My understanding was, if the subset column(s) does not exist we need to raise polars.exceptions.ColumnNotFoundError
. This approach would raise polars.exceptions.SchemaFieldNotFoundError
. On a high level look both seem appropriate to me, but if SchemaFieldNotFoundError
is correct, I can update this piece to less verbose like you suggested.
Something like this:
let cols = expand_selectors(s, input_schema.as_ref(), &[])?;
// Checking if subset columns exist in the dataframe
for col in cols.iter() {
let _ = input_schema.try_get(col)?;
}
Ok::<_, PolarsError>(cols)
This PR addresses an issue where the
unique() function
in Polars does not raise aColumnNotFoundError
when called on an empty DataFrame with an unknown subset of column names. The changes ensure that column names in the subset are validated before proceeding, thereby raising the appropriate exception.Changes Made:
Rust:
UniqueExec
executor to check the subset of column names provided exists in an empty DataFrame.Python Tests:
test_unique_with_bad_subset
intest_unique.py
to handle scenarios whereSubset column name(s) do not exist
.ColumnNotFoundError
with appropriate message.Linked Issue:
Closes #20209
Checklist:
main
branch.pytest
for Python tests.