Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(rust): Validate column names in unique() for empty DataFrames #20411

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

Biswas-N
Copy link

This PR addresses an issue where the unique() function in Polars does not raise a ColumnNotFoundError when called on an empty DataFrame with an unknown subset of column names. The changes ensure that column names in the subset are validated before proceeding, thereby raising the appropriate exception.

Changes Made:

  1. Rust:

    • Added validation logic in UniqueExec executor to check the subset of column names provided exists in an empty DataFrame.
  2. Python Tests:

    • Introduced a new test method, test_unique_with_bad_subset in test_unique.py to handle scenarios where Subset column name(s) do not exist.
    • Ensured invalid subset(s) raise a ColumnNotFoundError with appropriate message.

Linked Issue:

Closes #20209

Checklist:

  • Changes rebased against the latest main branch.
  • All new and existing tests pass.
  • Verified using pytest for Python tests.
  • Code adheres to the repository's contribution guidelines.

Ensures that column names in the subset parameter are validated even
when the dataframe is empty, maintaining consistent behavior with
non-empty dataframes.
Add test cases to verify that unique() properly handles invalid column
names in subset parameter for both empty and non-empty dataframes.
@github-actions github-actions bot added fix Bug fix rust Related to Rust Polars labels Dec 23, 2024
@Biswas-N Biswas-N marked this pull request as ready for review December 23, 2024 06:33
Copy link

codecov bot commented Dec 23, 2024

Codecov Report

Attention: Patch coverage is 50.00000% with 1 line in your changes missing coverage. Please review.

Project coverage is 79.02%. Comparing base (62ebbe5) to head (c982905).
Report is 10 commits behind head on main.

Files with missing lines Patch % Lines
...ates/polars-plan/src/plans/conversion/dsl_to_ir.rs 50.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #20411      +/-   ##
==========================================
+ Coverage   78.97%   79.02%   +0.04%     
==========================================
  Files        1562     1562              
  Lines      220103   220171      +68     
  Branches     2486     2486              
==========================================
+ Hits       173821   173981     +160     
+ Misses      45709    45617      -92     
  Partials      573      573              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ritchie46
Copy link
Member

These error should be raised during conversion to IR, not at the implementation level.

@Biswas-N
Copy link
Author

@ritchie46 thanks for having a look. I am new to contributing to pola-rs, could you help me by pointing at some code that raises issues during conversion to IR. It could help me understand pola-rs way of doing things.

let cols = expand_selectors(s, input_schema.as_ref(), &[])?;

// Checking if subset columns exist in the dataframe
cols.iter()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit nitpicky, but I think this can be written a little bit less verbose.

for c in &cols {
   let _ = input_schema.try_get(name)?
}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha no problem. My understanding was, if the subset column(s) does not exist we need to raise polars.exceptions.ColumnNotFoundError. This approach would raise polars.exceptions.SchemaFieldNotFoundError. On a high level look both seem appropriate to me, but if SchemaFieldNotFoundError is correct, I can update this piece to less verbose like you suggested.

Something like this:

let cols = expand_selectors(s, input_schema.as_ref(), &[])?;

// Checking if subset columns exist in the dataframe
for col in cols.iter() {
    let _ = input_schema.try_get(col)?;
}

Ok::<_, PolarsError>(cols)

@Biswas-N Biswas-N requested a review from ritchie46 December 24, 2024 14:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fix Bug fix rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DataFrame.unique() should raise if any subset column doesn't exist on empty frame.
2 participants