-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Polars dropping empty columns in read_xlsx
#20376
Comments
Just some notes from trying to debug the issue: It looks like the default engine may not yet allow empty columns: ToucanToco/fastexcel#303 import fastexcel
fastexcel.read_excel("test.xlsx").load_sheet_by_idx(0).to_polars()
# shape: (1, 2)
# ┌──────────┬──────────┐
# │ Header 1 ┆ Header 2 │
# │ --- ┆ --- │
# │ str ┆ str │
# ╞══════════╪══════════╡
# │ a ┆ b │
# └──────────┴──────────┘ However, it seems Polars also checks for and drops empty columns?
Out of interest, using the xlsx2csv engine directly was the only way I could get the desired behaviour. import io
import polars as pl
from xlsx2csv import Xlsx2csv
buffer = io.StringIO()
Xlsx2csv("test.xlsx").convert(buffer)
pl.read_csv(buffer.getvalue().encode(), try_parse_dates=True)
# shape: (1, 3)
# ┌──────┬──────────┬──────────┐
# │ ┆ Header 1 ┆ Header 2 │
# │ --- ┆ --- ┆ --- │
# │ str ┆ str ┆ str │
# ╞══════╪══════════╪══════════╡
# │ null ┆ a ┆ b │
# └──────┴──────────┴──────────┘ |
Just to add on to this. I think shape: (2, 3)
┌──────────┬──────────┬──────────┐
│ column_1 ┆ column_2 ┆ column_3 │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞══════════╪══════════╪══════════╡
│ null ┆ Header 1 ┆ Header 2 │
│ null ┆ a ┆ b │
└──────────┴──────────┴──────────┘ I think the desired behaviour should retain the empty row at index |
Also, I have not contributed to Polars before, but if this is a decent first issue to tackle, I would be willing to take it. |
Excel tables frequently start at arbitrary locations on the sheet - if your column doesn't have any data AND doesn't have a header, I think it's extremely reasonable that most engines will not consider it to be column. We could consider influencing this with a new (I'll re-tag this as a feature request for "drop_empty_cols"). |
@alexander-beedie I think there's arguments on both sides here. I understand your point. I didn't know whether to tag this as a Precisely because Excel tables frequently start at arbitrary locations, there is a practical need to preserve this information in data pipelines. My motivation for this issue, for instance, comes from a business use case that, currently, Pandas and other dataframe libraries provide by default. It would be nice for Polars to do the same, since it's such a great library! |
I don't believe we should identify an empty column with no header and no values in it as a data-containing column by default, but I have no objections to implementing a related parameter that allows opt-in to this behaviour. Indeed, I can probably add it for you quite quickly 👍 |
Have just committed a PR adding the new "drop_empty_cols" parameter on our side, but it will need a calamine/fastexcel update for it to influence that engine (which is our default). So, if you need this functionality you'll have to use one of the other two slower engines for now, setting "drop_empty_cols=False" and taking the performance hit... |
Checks
Reproducible example
Using
test.xlsx
, which is attached and looks like:the result of
read_excel
is missing the first empty column:Pandas retains the empty column/leftwise columns, which I think is the desired behaviour, since it preserves the column ordering.
From what I can see, there is no available option to achieve the same behaviour in Polars.
Test data:
test.xlsx
Log output
No response
Issue description
Polars is dropping empty columns left-wise of the existing data in an Excel file. The usual behaviour of, e.g. Pandas, is to preserve the column ordering, such that we can ensure data appearing in column B, for instance, will be column B in Polars by default.
Expected behavior
Three columns in the
DataFrame
, the first one nulls.Installed versions
The text was updated successfully, but these errors were encountered: