Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ComputeError when reading a CSV with [square brackets] in file path #19801

Open
2 tasks done
aofarrel opened this issue Nov 15, 2024 · 2 comments
Open
2 tasks done

ComputeError when reading a CSV with [square brackets] in file path #19801

aofarrel opened this issue Nov 15, 2024 · 2 comments
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@aofarrel
Copy link

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl
tsv_1 = "./inputs/Merker_2022 (runindexed)/data.tsv"
tsv_2 = "./inputs/Merker_2022 [runindexed]/data.tsv"

# proves that the files are accessible and identical
print("Contents of tsv_1:")
print(open(tsv_1, "r").read())
print("Contents of tsv_2:")
print(open(tsv_2, "r").read())

print(pl.read_csv(tsv_1, separator='\t', ignore_errors=True))  # this works
print(pl.read_csv(tsv_2, separator='\t', ignore_errors=True))  # this throws the error

Log output

Contents of tsv_1:
run_index	geoloc_name	date_collection
ERR108514	Russia: Samara	-
ERR108499	Russia: Samara	-
ERR234597	Russia: Samara	-
ERR133815	Russia: Samara	-
ERR067584	Russia: Samara	-
ERR133837	Russia: Samara	-
ERR067723	Russia: Samara	-
SRR1163081	Belarus	2011
SRR1162980	Belarus	2010
SRR1163178	Belarus	2009
SRR1162977	Belarus	2010
Contents of tsv_2:
run_index	geoloc_name	date_collection
ERR108514	Russia: Samara	-
ERR108499	Russia: Samara	-
ERR234597	Russia: Samara	-
ERR133815	Russia: Samara	-
ERR067584	Russia: Samara	-
ERR133837	Russia: Samara	-
ERR067723	Russia: Samara	-
SRR1163081	Belarus	2011
SRR1162980	Belarus	2010
SRR1163178	Belarus	2009
SRR1162977	Belarus	2010
shape: (11, 3)
┌────────────┬────────────────┬─────────────────┐
│ run_index  ┆ geoloc_name    ┆ date_collection │
│ ---        ┆ ---            ┆ ---             │
│ str        ┆ str            ┆ str             │
╞════════════╪════════════════╪═════════════════╡
│ ERR108514  ┆ Russia: Samara ┆ -               │
│ ERR108499  ┆ Russia: Samara ┆ -               │
│ ERR234597  ┆ Russia: Samara ┆ -               │
│ ERR133815  ┆ Russia: Samara ┆ -               │
│ ERR067584  ┆ Russia: Samara ┆ -               │
│ …          ┆ …              ┆ …               │
│ ERR067723  ┆ Russia: Samara ┆ -               │
│ SRR1163081 ┆ Belarus        ┆ 2011            │
│ SRR1162980 ┆ Belarus        ┆ 2010            │
│ SRR1163178 ┆ Belarus        ┆ 2009            │
│ SRR1162977 ┆ Belarus        ┆ 2010            │
└────────────┴────────────────┴─────────────────┘
Traceback (most recent call last):
  File "/Users/aofarrel/github/ranchero/bug.py", line 12, in <module>
    print(pl.read_csv(tsv_2, separator='\t', ignore_errors=True))  # this throws the error
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/polars/_utils/deprecation.py", line 92, in wrapper
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/polars/_utils/deprecation.py", line 92, in wrapper
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/polars/_utils/deprecation.py", line 92, in wrapper
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/polars/io/csv/functions.py", line 508, in read_csv
    df = _read_csv_impl(
         ^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/polars/io/csv/functions.py", line 641, in _read_csv_impl
    return scan.collect()
           ^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/polars/lazyframe/frame.py", line 2021, in collect
    return wrap_df(ldf.collect(callback))
                   ^^^^^^^^^^^^^^^^^^^^^
polars.exceptions.ComputeError: expected at least 1 source

Issue description

In my case data.tsv, identical at both paths, is a three-column valid TSV file, but this seems to happen on basically any TSV/CSV as long as there are square brackets in the path somewhere. File is attached (extension changed to .txt to keep GitHub happy)

data.tsv.txt

Expected behavior

The behavior of

print(pl.read_csv("./inputs/Merker_2022 (runindexed)/data.tsv", separator='\t', ignore_errors=True))

and

print(pl.read_csv("./inputs/Merker_2022 [runindexed]/data.tsv", separator='\t', ignore_errors=True))

should be identical, just like how they are identical when opening with standard python open(). If polars can't accept square brackets in a path, it should throw an error saying so when brackets are present, or just throw a file-not-found error.

Installed versions

--------Version info---------
Polars:              1.13.1
Index type:          UInt32
Platform:            macOS-13.6.7-x86_64-i386-64bit
Python:              3.11.4 (v3.11.4:d2340ef257, Jun  6 2023, 19:15:51) [Clang 13.0.0 (clang-1300.0.29.30)]
LTS CPU:             False

----Optional dependencies----
adbc_driver_manager  <not installed>
altair               <not installed>
cloudpickle          <not installed>
connectorx           <not installed>
deltalake            <not installed>
fastexcel            <not installed>
fsspec               <not installed>
gevent               <not installed>
great_tables         <not installed>
matplotlib           3.8.2
nest_asyncio         <not installed>
numpy                1.25.2
openpyxl             <not installed>
pandas               2.2.2
pyarrow              17.0.0
pydantic             <not installed>
pyiceberg            <not installed>
sqlalchemy           <not installed>
torch                <not installed>
xlsx2csv             <not installed>
xlsxwriter           <not installed>
@aofarrel aofarrel added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Nov 15, 2024
@cmdlineluser
Copy link
Contributor

It's due to [] being glob characters and glob=True being the default.

pl.DataFrame({"x": [1]}).write_csv("[foo].csv")

pl.read_csv("[foo].csv", glob=False)
# shape: (1, 1)
# ┌─────┐
# │ x   │
# │ --- │
# │ i64 │
# ╞═════╡
# │ 1   │
# └─────┘
pl.read_csv("[foo].csv")
# ComputeError: expected at least 1 source

I'm not sure if some sort of Hint: did you mean glob=False message could be added in the case when glob chars are present, but no files are matched?

@aofarrel
Copy link
Author

It's due to [] being glob characters and glob=True being the default.

pl.DataFrame({"x": [1]}).write_csv("[foo].csv")

pl.read_csv("[foo].csv", glob=False)
# shape: (1, 1)
# ┌─────┐
# │ x   │
# │ --- │
# │ i64 │
# ╞═════╡
# │ 1   │
# └─────┘
pl.read_csv("[foo].csv")
# ComputeError: expected at least 1 source

I'm not sure if some sort of Hint: did you mean glob=False message could be added in the case when glob chars are present, but no files are matched?

Oh, that explains it. Yeah, I think that kind of hint would work, or at least changing the error to a more straightforward file-not-found.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

2 participants