Incompatibility between pandas and rust #20366

moghadas76 · 2024-12-19T13:00:45Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

pd.to_csv(...)
pl.read_csv(...)

Log output

File ~/miniconda3/envs/env/lib/python3.11/site-packages/polars/io/csv/functions.py:672, in _read_csv_impl(source, has_header, columns, separator, comment_prefix, quote_char, skip_rows, schema, schema_overrides, null_values, missing_utf8_is_empty_string, ignore_errors, try_parse_dates, n_threads, infer_schema_length, batch_size, n_rows, encoding, low_memory, rechunk, skip_rows_after_header, row_index_name, row_index_offset, sample_size, eol_char, raise_if_empty, truncate_ragged_lines, decimal_comma, glob)
    668         raise ValueError(msg)
    670 projection, columns = parse_columns_arg(columns)
--> 672 pydf = PyDataFrame.read_csv(
    673     source,
    674     infer_schema_length,
    675     batch_size,
    676     has_header,
    677     ignore_errors,
    678     n_rows,
    679     skip_rows,
    680     projection,
    681     separator,
    682     rechunk,
    683     columns,
    684     encoding,
    685     n_threads,
    686     path,
    687     dtype_list,
    688     dtype_slice,
    689     low_memory,
    690     comment_prefix,
    691     quote_char,
    692     processed_null_values,
    693     missing_utf8_is_empty_string,
    694     try_parse_dates,
    695     skip_rows_after_header,
    696     parse_row_index_args(row_index_name, row_index_offset),
    697     eol_char=eol_char,
    698     raise_if_empty=raise_if_empty,
    699     truncate_ragged_lines=truncate_ragged_lines,
    700     decimal_comma=decimal_comma,
    701     schema=schema,
    702 )
    703 return wrap_df(pydf)

ComputeError: could not parse `A` as dtype `i64` at column 'properties_local_ref' (column number 36)

The current offset in the file is 7327452 bytes.

You might want to try:
- increasing `infer_schema_length` (e.g. `infer_schema_length=10000`),
- specifying correct dtype with the `schema_overrides` argument
- setting `ignore_errors` to `True`,
- adding `A` to the `null_values` list.

Original error:  bytes non-empty

Issue description

File ~/miniconda3/envs/env/lib/python3.11/site-packages/polars/io/csv/functions.py:672, in _read_csv_impl(source, has_header, columns, separator, comment_prefix, quote_char, skip_rows, schema, schema_overrides, 668 670 --> 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 null_values, missing_utf8_is_empty_string, ignore_errors, try_parse_dates, n_threads, infer_schema_length, batch_size, n_rows, encoding, low_memory, rechunk, skip_rows_after_header, row_index_name, row_index_offset, sample_size, eol_char, raise_if_empty, truncate_ragged_lines, decimal_comma, glob)
raise ValueError(msg)
projection, columns = parse_columns_arg(columns)
pydf = PyDataFrame.read_csv(
source,
infer_schema_length,
batch_size,
has_header,
ignore_errors,
n_rows,
skip_rows,
projection,
separator,
rechunk,
columns,
encoding,
n_threads,
path,
dtype_list,
dtype_slice,
low_memory,
comment_prefix,
quote_char,
processed_null_values,
missing_utf8_is_empty_string,
try_parse_dates,
skip_rows_after_header,
parse_row_index_args(row_index_name, row_index_offset),
eol_char=eol_char,
raise_if_empty=raise_if_empty,
truncate_ragged_lines=truncate_ragged_lines,
decimal_comma=decimal_comma,
schema=schema,
)
return wrap_df(pydf)

ComputeError: could not parse A as dtype i64 at column 'properties_local_ref' (column number 36)

The current offset in the file is 7327452 bytes.

You might want to try:

increasing infer_schema_length (e.g. infer_schema_length=10000),
specifying correct dtype with the schema_overrides argument
setting ignore_errors to True,
adding A to the null_values list.

Original error: remaining bytes non-empty

Expected behavior

loading consistantly

Installed versions

--------Version info---------
Polars:              1.14.0
Index type:          UInt32
Platform:            Linux-6.8.0-49-generic-x86_64-with-glibc2.35
Python:              3.11.5 (main, Sep 11 2023, 13:54:46) [GCC 11.2.0]
LTS CPU:             False

The text was updated successfully, but these errors were encountered:

orlp · 2024-12-19T13:01:44Z

Without a dataset to reproduce the issue we can't inspect the issue and fix any potential bugs.

moghadas76 · 2024-12-19T13:04:34Z

I want to contribute into the project. Could you help me what is the procedure?

orlp · 2024-12-19T13:27:12Z

@moghadas76 As I said, the first step is to provide a minimal reproducible example of the problem. That includes any data necessary to reproduce it.

hutch3232 · 2024-12-20T14:32:14Z

What's the dtype of properties_local_ref when it's in a pd.DataFrame? The error is pretty clear that the parser thought that column was an int but then it encountered an "A" so that's why it errored.

It gave suggestions on how to workaround that.

moghadas76 added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Dec 19, 2024

orlp added the needs repro Bug does not yet have a reproducible example label Dec 19, 2024

rodrigogiraoserrao changed the title ~~Inompatibility between pandas and rust~~ Incompatibility between pandas and rust Dec 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incompatibility between pandas and rust #20366

Incompatibility between pandas and rust #20366

moghadas76 commented Dec 19, 2024

orlp commented Dec 19, 2024

moghadas76 commented Dec 19, 2024

orlp commented Dec 19, 2024

hutch3232 commented Dec 20, 2024

Incompatibility between pandas and rust #20366

Incompatibility between pandas and rust #20366

Comments

moghadas76 commented Dec 19, 2024

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions

orlp commented Dec 19, 2024

moghadas76 commented Dec 19, 2024

orlp commented Dec 19, 2024

hutch3232 commented Dec 20, 2024