Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pyarrow lazyframes not collected properly in latest version #20370

Open
2 tasks done
Alex23rodriguez opened this issue Dec 19, 2024 · 3 comments
Open
2 tasks done

Pyarrow lazyframes not collected properly in latest version #20370

Alex23rodriguez opened this issue Dec 19, 2024 · 3 comments
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@Alex23rodriguez
Copy link

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl
import pyarrow.dataset as ds


def scan_parquet(path: str):
    dset = ds.dataset(path)
    return pl.scan_pyarrow_dataset(dset) # this can also read partitioned folders, which is why i'm using it instead of pl.scan_parquet


def get_len(file: str, lazy=True):
    if lazy:
        df = scan_parquet(file)
    else:
        df = pl.read_parquet(file)

    return df.select(pl.len()) # notice how we select only the length


if __name__ == "__main__":
    s3path = f"s3://some_bucket/path/to/some.parquet"

    # eager version works fine
    df = get_len(s3path, lazy=False)
    print(df.columns)
    print(df["len"]) 

    # lazy version
    lf = get_len(s3path, lazy=True)
    print(lf.collect_schema().names()) # correctly assumes it should have a column named len
    print(lf.collect()["len"]) # crashes because `collect()` ignores the `pl.len()` and just returns the whole parquet

Log output

Auto-selected credential provider: CredentialProviderAWS
Async thread count: 4
[FetchedCredentialsCache]: Call update_func: current_time = 1734616224, last_fetched_expiry = 0
[FetchedCredentialsCache]: Finish update_func: new expiry = (never expires)
async download_chunk_size: 67108864
[FetchedCredentialsCache]: Using cached credentials[FetchedCredentialsCache]: Using cached credentials: current_time = 1734616226, expiry = (never expires)                                                                             POLARS PREFETCH_SIZE: 22
querying metadata of 1/1 files...
reading of 1/1 file...
POLARS ROW_GROUP PREFETCH_SIZE: 128
[FetchedCredentialsCache]: Using cached credentials: current_time = 1734616226, expiry = (never expires)                      parquet scan with parallel = Columns
['len']
shape: (1,)
Series: 'len' [u32]
[
        12405
]
['len']
Traceback (most recent call last):
  File "/Users/me/Documents/Misc/polars_bugs/main.py", line 29, in <module>
    print(lf.collect()["len"])
          ~~~~~~~~~~~~^^^^^^^
  File "/Users/me/Documents/Misc/polars_bugs/.venv/lib/python3.12/site-packages/polars/dataframe/frame.py", line 1376, in __getitem__
    return get_df_item_by_key(self, key)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/me/Documents/Misc/polars_bugs/.venv/lib/python3.12/site-packages/polars/_utils/getitem.py", line 163, in get_df_item_by_key
    return df.get_column(key)
           ^^^^^^^^^^^^^^^^^^
  File "/Users/me/Documents/Misc/polars_bugs/.venv/lib/python3.12/site-packages/polars/dataframe/frame.py", line 8209, in get_column
    return wrap_s(self._df.get_column(name))
                  ^^^^^^^^^^^^^^^^^^^^^^^^^
polars.exceptions.ColumnNotFoundError: "len" not found

Issue description

In the new version of polars, scanning cloud folders with pyarrow's scan_pyarrow_dataset no longer works as expected.

I believe this might be related to issue #20279 but i'm not certain. In any case, here is a simple and reproducible example

Expected behavior

calling df.select(pl.len()).collect() on a LazyFrame read through pyarrow should correctly evaluate the len of the df, instead of returning the whole df

Installed versions

--------Version info---------
Polars:              1.17.1
Index type:          UInt32
Platform:            macOS-15.2-arm64-arm-64bit
Python:              3.12.2 (main, Sep  6 2024, 17:50:00) [Clang 15.0.0 (clang-1500.3.9.4)]
LTS CPU:             False

----Optional dependencies----
adbc_driver_manager  <not installed>
altair               <not installed>
boto3                1.35.84
cloudpickle          <not installed>
connectorx           <not installed>
deltalake            <not installed>
fastexcel            <not installed>
fsspec               <not installed>
gevent               <not installed>
google.auth          <not installed>
great_tables         <not installed>
matplotlib           <not installed>
nest_asyncio         <not installed>
numpy                <not installed>
openpyxl             <not installed>
pandas               <not installed>
pyarrow              18.1.0
pydantic             <not installed>
pyiceberg            <not installed>
sqlalchemy           <not installed>
torch                <not installed>
xlsx2csv             <not installed>
xlsxwriter           <not installed>
@Alex23rodriguez Alex23rodriguez added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Dec 19, 2024
@ritchie46
Copy link
Member

ritchie46 commented Dec 21, 2024

this can also read partitioned folders, which is why i'm using it instead of pl.scan_parquet

Aside from the bug, I am fairly certain we also support partitioned folders

On the bug, I wonder if it still reproduces on next release.

@Alex23rodriguez
Copy link
Author

I am fairly certain we also support partitioned folders

I apologize! I realized I was using pl.scan_pyarrow_dataset(dset) because this would successfully recognize my aws credentials, whereas pl.scan_parquet didn't (at the time).

This seems to be better now so I might be able to just do this :) Thank you!

@ritchie46
Copy link
Member

This seems to be better now so I might be able to just do this :)

Yes, that should be much faster. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

2 participants