Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Polars dropping empty columns in read_xlsx #20376

Closed
2 tasks done
cmgoold opened this issue Dec 19, 2024 · 7 comments · Fixed by #20430
Closed
2 tasks done

Polars dropping empty columns in read_xlsx #20376

cmgoold opened this issue Dec 19, 2024 · 7 comments · Fixed by #20430
Labels
A-io-spreadsheet Area: reading/writing Excel/ODS files enhancement New feature or an improvement of an existing feature needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@cmgoold
Copy link

cmgoold commented Dec 19, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

Using test.xlsx, which is attached and looks like:

Screenshot 2024-12-19 at 17 10 29

the result of read_excel is missing the first empty column:

In [1]: pl.read_excel("test.xlsx")
Out[1]: 
shape: (1, 2)
┌──────────┬──────────┐
│ Header 1Header 2 │
│ ------      │
│ strstr      │
╞══════════╪══════════╡
│ ab        │
└──────────┴──────────┘

Pandas retains the empty column/leftwise columns, which I think is the desired behaviour, since it preserves the column ordering.

In [2]: pd.read_excel("test.xlsx")
Out[2]: 
   Unnamed: 0 Unnamed: 1 Unnamed: 2
0         NaN   Header 1   Header 2
1         NaN         a           b

From what I can see, there is no available option to achieve the same behaviour in Polars.

Test data:
test.xlsx

Log output

No response

Issue description

Polars is dropping empty columns left-wise of the existing data in an Excel file. The usual behaviour of, e.g. Pandas, is to preserve the column ordering, such that we can ensure data appearing in column B, for instance, will be column B in Polars by default.

Expected behavior

Three columns in the DataFrame, the first one nulls.

Installed versions

--------Version info---------
Polars:              1.17.1
Index type:          UInt32
Platform:            macOS-14.7.1-arm64-arm-64bit
Python:              3.10.15 (main, Sep  7 2024, 00:20:06) [Clang 15.0.0 (clang-1500.3.9.4)]
LTS CPU:             False

----Optional dependencies----
adbc_driver_manager  <not installed>
altair               <not installed>
boto3                1.24.59
cloudpickle          3.0.0
connectorx           <not installed>
deltalake            <not installed>
fastexcel            0.12.0
fsspec               2023.1.0
gevent               <not installed>
google.auth          <not installed>
great_tables         <not installed>
matplotlib           3.9.0
nest_asyncio         1.6.0
numpy                1.26.4
openpyxl             3.1.5
pandas               2.2.3
pyarrow              18.0.0
pydantic             2.9.2
pyiceberg            <not installed>
sqlalchemy           1.4.54
torch                <not installed>
xlsx2csv             <not installed>
xlsxwriter           3.2.0
@cmgoold cmgoold added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Dec 19, 2024
@cmdlineluser
Copy link
Contributor

Just some notes from trying to debug the issue:

It looks like the default engine may not yet allow empty columns: ToucanToco/fastexcel#303

import fastexcel

fastexcel.read_excel("test.xlsx").load_sheet_by_idx(0).to_polars()

# shape: (1, 2)
# ┌──────────┬──────────┐
# │ Header 1 ┆ Header 2 │
# │ ---      ┆ ---      │
# │ str      ┆ str      │
# ╞══════════╪══════════╡
# │ a        ┆ b        │
# └──────────┴──────────┘

However, it seems Polars also checks for and drops empty columns?

If DataFrame contains columns/rows that contain only nulls, drop them.

Out of interest, using the xlsx2csv engine directly was the only way I could get the desired behaviour.

import io
import polars as pl
from xlsx2csv import Xlsx2csv

buffer = io.StringIO()
Xlsx2csv("test.xlsx").convert(buffer)

pl.read_csv(buffer.getvalue().encode(), try_parse_dates=True)

# shape: (1, 3)
# ┌──────┬──────────┬──────────┐
# │      ┆ Header 1 ┆ Header 2 │
# │ ---  ┆ ---      ┆ ---      │
# │ str  ┆ str      ┆ str      │
# ╞══════╪══════════╪══════════╡
# │ null ┆ a        ┆ b        │
# └──────┴──────────┴──────────┘

@cmgoold
Copy link
Author

cmgoold commented Dec 20, 2024

Just to add on to this. I think read_csv in the above 'working' version using the string IO stream still skips the first empty row, even when has_header=False:

shape: (2, 3)
┌──────────┬──────────┬──────────┐
│ column_1column_2column_3 │
│ ---------      │
│ strstrstr      │
╞══════════╪══════════╪══════════╡
│ nullHeader 1Header 2 │
│ nullab        │
└──────────┴──────────┴──────────┘

I think the desired behaviour should retain the empty row at index 0, not discard it.

@cmgoold
Copy link
Author

cmgoold commented Dec 20, 2024

Also, I have not contributed to Polars before, but if this is a decent first issue to tackle, I would be willing to take it.

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Dec 22, 2024

Excel tables frequently start at arbitrary locations on the sheet - if your column doesn't have any data AND doesn't have a header, I think it's extremely reasonable that most engines will not consider it to be column.

We could consider influencing this with a new drop_empty_cols parameter, given that we already have a drop_empty_rows parameter, but I do not consider this to be a bug - indeed, I'd argue that the "xlsx2csv" engine is doing the wrong thing here by including it ;)

(I'll re-tag this as a feature request for "drop_empty_cols").

@alexander-beedie alexander-beedie added enhancement New feature or an improvement of an existing feature A-io-spreadsheet Area: reading/writing Excel/ODS files and removed bug Something isn't working labels Dec 22, 2024
@cmgoold
Copy link
Author

cmgoold commented Dec 22, 2024

@alexander-beedie I think there's arguments on both sides here. I understand your point. I didn't know whether to tag this as a bug or a feature myself, as I didn't know the design decisions behind not providing a drop_empty_cols parameter.

Precisely because Excel tables frequently start at arbitrary locations, there is a practical need to preserve this information in data pipelines. My motivation for this issue, for instance, comes from a business use case that, currently, Pandas and other dataframe libraries provide by default. It would be nice for Polars to do the same, since it's such a great library!

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Dec 22, 2024

I don't believe we should identify an empty column with no header and no values in it as a data-containing column by default, but I have no objections to implementing a related parameter that allows opt-in to this behaviour. Indeed, I can probably add it for you quite quickly 👍

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Dec 23, 2024

Precisely because Excel tables frequently start at arbitrary locations, there is a practical need to preserve this information in data pipelines. My motivation for this issue, for instance, comes from a business use case that, currently, Pandas and other dataframe libraries provide by default. It would be nice for Polars to do the same, since it's such a great library!

Have just committed a PR adding the new "drop_empty_cols" parameter on our side, but it will need a calamine/fastexcel update for it to influence that engine (which is our default). So, if you need this functionality you'll have to use one of the other two slower engines for now, setting "drop_empty_cols=False" and taking the performance hit...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-io-spreadsheet Area: reading/writing Excel/ODS files enhancement New feature or an improvement of an existing feature needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants