fix(rust): Validate column names in `unique()` for empty DataFrames #20411

Biswas-N · 2024-12-23T06:08:04Z

This PR addresses an issue where the unique() function in Polars does not raise a ColumnNotFoundError when called on an empty DataFrame with an unknown subset of column names. The changes ensure that column names in the subset are validated before proceeding, thereby raising the appropriate exception.

Changes Made:

Rust:
- Added validation logic in UniqueExec executor to check the subset of column names provided exists in an empty DataFrame.
Python Tests:
- Introduced a new test method, test_unique_with_bad_subset in test_unique.py to handle scenarios where Subset column name(s) do not exist.
- Ensured invalid subset(s) raise a ColumnNotFoundError with appropriate message.

Linked Issue:

Closes #20209

Checklist:

Changes rebased against the latest main branch.
All new and existing tests pass.
Verified using pytest for Python tests.
Code adheres to the repository's contribution guidelines.

Ensures that column names in the subset parameter are validated even when the dataframe is empty, maintaining consistent behavior with non-empty dataframes.

Add test cases to verify that unique() properly handles invalid column names in subset parameter for both empty and non-empty dataframes.

codecov · 2024-12-23T06:59:59Z

Codecov Report

Attention: Patch coverage is 50.00000% with 1 line in your changes missing coverage. Please review.

Project coverage is 79.02%. Comparing base (62ebbe5) to head (c982905).
Report is 10 commits behind head on main.

Files with missing lines	Patch %	Lines
...ates/polars-plan/src/plans/conversion/dsl_to_ir.rs	50.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #20411      +/-   ##
==========================================
+ Coverage   78.97%   79.02%   +0.04%     
==========================================
  Files        1562     1562              
  Lines      220103   220171      +68     
  Branches     2486     2486              
==========================================
+ Hits       173821   173981     +160     
+ Misses      45709    45617      -92     
  Partials      573      573

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ritchie46 · 2024-12-23T08:43:43Z

These error should be raised during conversion to IR, not at the implementation level.

Biswas-N · 2024-12-23T13:57:54Z

@ritchie46 thanks for having a look. I am new to contributing to pola-rs, could you help me by pointing at some code that raises issues during conversion to IR. It could help me understand pola-rs way of doing things.

ritchie46 · 2024-12-24T09:45:46Z

crates/polars-plan/src/plans/conversion/dsl_to_ir.rs

+                    let cols = expand_selectors(s, input_schema.as_ref(), &[])?;
+
+                    // Checking if subset columns exist in the dataframe
+                    cols.iter()


A bit nitpicky, but I think this can be written a little bit less verbose.

for c in &cols { let _ = input_schema.try_get(name)? }

Haha no problem. My understanding was, if the subset column(s) does not exist we need to raise polars.exceptions.ColumnNotFoundError. This approach would raise polars.exceptions.SchemaFieldNotFoundError. On a high level look both seem appropriate to me, but if SchemaFieldNotFoundError is correct, I can update this piece to less verbose like you suggested.

Something like this:

let cols = expand_selectors(s, input_schema.as_ref(), &[])?; // Checking if subset columns exist in the dataframe for col in cols.iter() { let _ = input_schema.try_get(col)?; } Ok::<_, PolarsError>(cols)

Biswas-N added 2 commits December 22, 2024 23:51

fix: validate column names in unique() for empty dataframes

0461cd9

Ensures that column names in the subset parameter are validated even when the dataframe is empty, maintaining consistent behavior with non-empty dataframes.

test: add test cases for invalid subset in unique()

362471a

Add test cases to verify that unique() properly handles invalid column names in subset parameter for both empty and non-empty dataframes.

github-actions bot added fix Bug fix rust Related to Rust Polars labels Dec 23, 2024

refactor: Improve type annotations for in test_unique.py

df1f50b

Biswas-N marked this pull request as ready for review December 23, 2024 06:33

Biswas-N requested review from ritchie46, c-peters, alexander-beedie, MarcoGorelli, reswqa and orlp as code owners December 23, 2024 06:33

fix: validate subset column names in unique() during conversion to IR

c982905

ritchie46 reviewed Dec 24, 2024

View reviewed changes

Biswas-N requested a review from ritchie46 December 24, 2024 14:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(rust): Validate column names in `unique()` for empty DataFrames #20411

fix(rust): Validate column names in `unique()` for empty DataFrames #20411

Biswas-N commented Dec 23, 2024

codecov bot commented Dec 23, 2024 •

edited

Loading

ritchie46 commented Dec 23, 2024

Biswas-N commented Dec 23, 2024

ritchie46 Dec 24, 2024

Biswas-N Dec 24, 2024

fix(rust): Validate column names in unique() for empty DataFrames #20411

Are you sure you want to change the base?

fix(rust): Validate column names in unique() for empty DataFrames #20411

Conversation

Biswas-N commented Dec 23, 2024

Changes Made:

Linked Issue:

Checklist:

codecov bot commented Dec 23, 2024 • edited Loading

Codecov Report

ritchie46 commented Dec 23, 2024

Biswas-N commented Dec 23, 2024

ritchie46 Dec 24, 2024

Choose a reason for hiding this comment

Biswas-N Dec 24, 2024

Choose a reason for hiding this comment

fix(rust): Validate column names in `unique()` for empty DataFrames #20411

fix(rust): Validate column names in `unique()` for empty DataFrames #20411

codecov bot commented Dec 23, 2024 •

edited

Loading