Polars dropping empty columns in `read_xlsx` #20376

cmgoold · 2024-12-19T17:17:43Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

Using test.xlsx, which is attached and looks like:

the result of read_excel is missing the first empty column:

In [1]: pl.read_excel("test.xlsx")
Out[1]: 
shape: (1, 2)
┌──────────┬──────────┐
│ Header 1 ┆ Header 2 │
│ ---      ┆ ---      │
│ str      ┆ str      │
╞══════════╪══════════╡
│ a        ┆ b        │
└──────────┴──────────┘

Pandas retains the empty column/leftwise columns, which I think is the desired behaviour, since it preserves the column ordering.

In [2]: pd.read_excel("test.xlsx")
Out[2]: 
   Unnamed: 0 Unnamed: 1 Unnamed: 2
0         NaN   Header 1   Header 2
1         NaN         a           b

From what I can see, there is no available option to achieve the same behaviour in Polars.

Test data:
test.xlsx

Log output

No response

Issue description

Polars is dropping empty columns left-wise of the existing data in an Excel file. The usual behaviour of, e.g. Pandas, is to preserve the column ordering, such that we can ensure data appearing in column B, for instance, will be column B in Polars by default.

Expected behavior

Three columns in the DataFrame, the first one nulls.

Installed versions

--------Version info---------
Polars:              1.17.1
Index type:          UInt32
Platform:            macOS-14.7.1-arm64-arm-64bit
Python:              3.10.15 (main, Sep  7 2024, 00:20:06) [Clang 15.0.0 (clang-1500.3.9.4)]
LTS CPU:             False

----Optional dependencies----
adbc_driver_manager  <not installed>
altair               <not installed>
boto3                1.24.59
cloudpickle          3.0.0
connectorx           <not installed>
deltalake            <not installed>
fastexcel            0.12.0
fsspec               2023.1.0
gevent               <not installed>
google.auth          <not installed>
great_tables         <not installed>
matplotlib           3.9.0
nest_asyncio         1.6.0
numpy                1.26.4
openpyxl             3.1.5
pandas               2.2.3
pyarrow              18.0.0
pydantic             2.9.2
pyiceberg            <not installed>
sqlalchemy           1.4.54
torch                <not installed>
xlsx2csv             <not installed>
xlsxwriter           3.2.0

The text was updated successfully, but these errors were encountered:

cmdlineluser · 2024-12-20T09:35:15Z

Just some notes from trying to debug the issue:

It looks like the default engine may not yet allow empty columns: ToucanToco/fastexcel#303

import fastexcel

fastexcel.read_excel("test.xlsx").load_sheet_by_idx(0).to_polars()

# shape: (1, 2)
# ┌──────────┬──────────┐
# │ Header 1 ┆ Header 2 │
# │ ---      ┆ ---      │
# │ str      ┆ str      │
# ╞══════════╪══════════╡
# │ a        ┆ b        │
# └──────────┴──────────┘

However, it seems Polars also checks for and drops empty columns?

polars/py-polars/polars/io/spreadsheet/functions.py

Line 776 in ff00869

If DataFrame contains columns/rows that contain only nulls, drop them.

Out of interest, using the xlsx2csv engine directly was the only way I could get the desired behaviour.

import io
import polars as pl
from xlsx2csv import Xlsx2csv

buffer = io.StringIO()
Xlsx2csv("test.xlsx").convert(buffer)

pl.read_csv(buffer.getvalue().encode(), try_parse_dates=True)

# shape: (1, 3)
# ┌──────┬──────────┬──────────┐
# │      ┆ Header 1 ┆ Header 2 │
# │ ---  ┆ ---      ┆ ---      │
# │ str  ┆ str      ┆ str      │
# ╞══════╪══════════╪══════════╡
# │ null ┆ a        ┆ b        │
# └──────┴──────────┴──────────┘

cmgoold · 2024-12-20T11:06:13Z

Just to add on to this. I think read_csv in the above 'working' version using the string IO stream still skips the first empty row, even when has_header=False:

shape: (2, 3)
┌──────────┬──────────┬──────────┐
│ column_1 ┆ column_2 ┆ column_3 │
│ ---      ┆ ---      ┆ ---      │
│ str      ┆ str      ┆ str      │
╞══════════╪══════════╪══════════╡
│ null     ┆ Header 1 ┆ Header 2 │
│ null     ┆ a        ┆ b        │
└──────────┴──────────┴──────────┘

I think the desired behaviour should retain the empty row at index 0, not discard it.

cmgoold · 2024-12-20T14:33:46Z

Also, I have not contributed to Polars before, but if this is a decent first issue to tackle, I would be willing to take it.

alexander-beedie · 2024-12-22T12:13:51Z

Excel tables frequently start at arbitrary locations on the sheet - if your column doesn't have any data AND doesn't have a header, I think it's extremely reasonable that most engines will not consider it to be column.

We could consider influencing this with a new drop_empty_cols parameter, given that we already have a drop_empty_rows parameter, but I do not consider this to be a bug - indeed, I'd argue that the "xlsx2csv" engine is doing the wrong thing here by including it ;)

(I'll re-tag this as a feature request for "drop_empty_cols").

cmgoold · 2024-12-22T13:02:10Z

@alexander-beedie I think there's arguments on both sides here. I understand your point. I didn't know whether to tag this as a bug or a feature myself, as I didn't know the design decisions behind not providing a drop_empty_cols parameter.

Precisely because Excel tables frequently start at arbitrary locations, there is a practical need to preserve this information in data pipelines. My motivation for this issue, for instance, comes from a business use case that, currently, Pandas and other dataframe libraries provide by default. It would be nice for Polars to do the same, since it's such a great library!

alexander-beedie · 2024-12-22T13:07:39Z

I don't believe we should identify an empty column with no header and no values in it as a data-containing column by default, but I have no objections to implementing a related parameter that allows opt-in to this behaviour. Indeed, I can probably add it for you quite quickly 👍

alexander-beedie · 2024-12-23T23:03:27Z

Precisely because Excel tables frequently start at arbitrary locations, there is a practical need to preserve this information in data pipelines. My motivation for this issue, for instance, comes from a business use case that, currently, Pandas and other dataframe libraries provide by default. It would be nice for Polars to do the same, since it's such a great library!

Have just committed a PR adding the new "drop_empty_cols" parameter on our side, but it will need a calamine/fastexcel update for it to influence that engine (which is our default). So, if you need this functionality you'll have to use one of the other two slower engines for now, setting "drop_empty_cols=False" and taking the performance hit...

cmgoold added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Dec 19, 2024

alexander-beedie added enhancement New feature or an improvement of an existing feature A-io-spreadsheet Area: reading/writing Excel/ODS files and removed bug Something isn't working labels Dec 22, 2024

alexander-beedie mentioned this issue Dec 23, 2024

feat(python): Add "drop_empty_cols" parameter for read_excel and read_ods #20430

Merged

ritchie46 closed this as completed in #20430 Dec 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Polars dropping empty columns in `read_xlsx` #20376

Polars dropping empty columns in `read_xlsx` #20376

cmgoold commented Dec 19, 2024 •

edited

Loading

cmdlineluser commented Dec 20, 2024

cmgoold commented Dec 20, 2024 •

edited

Loading

cmgoold commented Dec 20, 2024

alexander-beedie commented Dec 22, 2024 •

edited

Loading

cmgoold commented Dec 22, 2024

alexander-beedie commented Dec 22, 2024 •

edited

Loading

alexander-beedie commented Dec 23, 2024 •

edited

Loading

Polars dropping empty columns in read_xlsx #20376

Polars dropping empty columns in read_xlsx #20376

Comments

cmgoold commented Dec 19, 2024 • edited Loading

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions

cmdlineluser commented Dec 20, 2024

cmgoold commented Dec 20, 2024 • edited Loading

cmgoold commented Dec 20, 2024

alexander-beedie commented Dec 22, 2024 • edited Loading

cmgoold commented Dec 22, 2024

alexander-beedie commented Dec 22, 2024 • edited Loading

alexander-beedie commented Dec 23, 2024 • edited Loading

Polars dropping empty columns in `read_xlsx` #20376

Polars dropping empty columns in `read_xlsx` #20376

cmgoold commented Dec 19, 2024 •

edited

Loading

cmgoold commented Dec 20, 2024 •

edited

Loading

alexander-beedie commented Dec 22, 2024 •

edited

Loading

alexander-beedie commented Dec 22, 2024 •

edited

Loading

alexander-beedie commented Dec 23, 2024 •

edited

Loading